nemo_automodel.components.datasets.vlm.pp_media

View as Markdown

Module Contents

Functions

NameDescription
_select_image_grid-
chunk_step3_mediaChunk Step3-style image tensors for PP microbatches.
chunk_vlm_mediaSplit VLM pixel values and media metadata into PP microbatch chunks.
prepare_vlm_media_for_ppMove VLM media tensors into pre-chunked PP media storage on the batch.
stage_vlm_media_for_ppAttach dataloader-prepared VLM media chunks to PP stage 0 for one schedule call.
wrap_vlm_collate_for_ppWrap a VLM collate function so it prepares media tensors for PP.

Data

VLM_PP_MEDIA_KEY

_VLM_MEDIA_KEYS

__all__

API

nemo_automodel.components.datasets.vlm.pp_media._select_image_grid(
image_grid_hws: torch.Tensor | None,
image_grid_thw: torch.Tensor | None,
image_sizes: torch.Tensor | None,
image_position_ids: torch.Tensor | None
) -> torch.Tensor | None
nemo_automodel.components.datasets.vlm.pp_media.chunk_step3_media(
pixel_values: torch.Tensor,
batch_size: int,
n_microbatches: int,
num_patches: torch.Tensor | None = None,
patch_pixel_values: torch.Tensor | None = None,
patch_newline_mask: torch.Tensor | None = None
) -> dict[str, list[torch.Tensor]]

Chunk Step3-style image tensors for PP microbatches.

Step3 processors emit one full image per sample in pixel_values and a flat list of optional crop patches in patch_pixel_values. num_patches maps samples to the flat patch tensor.

nemo_automodel.components.datasets.vlm.pp_media.chunk_vlm_media(
pixel_values: torch.Tensor,
image_grid: torch.Tensor,
batch_size: int,
n_microbatches: int,
n_images_per_sample: torch.Tensor | None = None
) -> tuple[list[torch.Tensor], list[torch.Tensor]]

Split VLM pixel values and media metadata into PP microbatch chunks.

Handles four layouts:

  1. [N, C, H, W] with N == batch_size — one full image per sample.
  2. [N, max_patches, D] with N == batch_size — padded patches per image.
  3. Flat patches [total_patches, D] with per-sample media counts from n_images_per_sample.
  4. Flat patches with n_images == batch_size — legacy one-image-per-sample.
nemo_automodel.components.datasets.vlm.pp_media.prepare_vlm_media_for_pp(
batch: collections.abc.MutableMapping[str, typing.Any],
batch_size: int,
n_microbatches: int
) -> collections.abc.MutableMapping[str, typing.Any]

Move VLM media tensors into pre-chunked PP media storage on the batch.

This is intended to run from VLM collate/dataloader code when PP is enabled. The returned batch no longer carries raw media tensors that PyTorch PP would chunk by row incorrectly; instead it carries VLM_PP_MEDIA_KEY with per-microbatch media chunks.

nemo_automodel.components.datasets.vlm.pp_media.stage_vlm_media_for_pp(
pp: typing.Any,
model_parts: list[torch.nn.Module],
batch: collections.abc.MutableMapping[str, typing.Any]
)

Attach dataloader-prepared VLM media chunks to PP stage 0 for one schedule call.

nemo_automodel.components.datasets.vlm.pp_media.wrap_vlm_collate_for_pp(
collate_fn: collections.abc.Callable[[Any], collections.abc.MutableMapping[str, typing.Any]],
n_microbatches: int
) -> collections.abc.Callable[[Any], collections.abc.MutableMapping[str, typing.Any]]

Wrap a VLM collate function so it prepares media tensors for PP.

nemo_automodel.components.datasets.vlm.pp_media.VLM_PP_MEDIA_KEY = '_vlm_pp_media_chunks'
nemo_automodel.components.datasets.vlm.pp_media._VLM_MEDIA_KEYS = ('pixel_values', 'patch_pixel_values', 'num_patches', 'patch_newline_mask', 'ima...
nemo_automodel.components.datasets.vlm.pp_media.__all__ = ['VLM_PP_MEDIA_KEY', 'chunk_vlm_media', 'prepare_vlm_media_for_pp', 'stage_vlm_m...