nemo_automodel.components.datasets.vlm.pp_media

Module Contents

Functions

Name	Description
`_select_image_grid`	-
`chunk_step3_media`	Chunk Step3-style image tensors for PP microbatches.
`chunk_vlm_media`	Split VLM pixel values and media metadata into PP microbatch chunks.
`prepare_vlm_media_for_pp`	Move VLM media tensors into pre-chunked PP media storage on the batch.
`stage_vlm_media_for_pp`	Attach dataloader-prepared VLM media chunks to PP stage 0 for one schedule call.
`wrap_vlm_collate_for_pp`	Wrap a VLM collate function so it prepares media tensors for PP.

Data

VLM_PP_MEDIA_KEY

_VLM_MEDIA_KEYS

__all__

API

nemo_automodel.components.datasets.vlm.pp_media._select_image_grid(
    image_grid_hws: torch.Tensor | None,
    image_grid_thw: torch.Tensor | None,
    image_sizes: torch.Tensor | None,
    image_position_ids: torch.Tensor | None
) -> torch.Tensor | None

nemo_automodel.components.datasets.vlm.pp_media.chunk_step3_media(
    pixel_values: torch.Tensor,
    batch_size: int,
    n_microbatches: int,
    num_patches: torch.Tensor | None = None,
    patch_pixel_values: torch.Tensor | None = None,
    patch_newline_mask: torch.Tensor | None = None
) -> dict[str, list[torch.Tensor]]

Chunk Step3-style image tensors for PP microbatches.

Step3 processors emit one full image per sample in pixel_values and a flat list of optional crop patches in patch_pixel_values. num_patches maps samples to the flat patch tensor.

nemo_automodel.components.datasets.vlm.pp_media.chunk_vlm_media(
    pixel_values: torch.Tensor,
    image_grid: torch.Tensor,
    batch_size: int,
    n_microbatches: int,
    n_images_per_sample: torch.Tensor | None = None
) -> tuple[list[torch.Tensor], list[torch.Tensor]]

Split VLM pixel values and media metadata into PP microbatch chunks.

Handles four layouts:

[N, C, H, W] with N == batch_size — one full image per sample.
[N, max_patches, D] with N == batch_size — padded patches per image.
Flat patches [total_patches, D] with per-sample media counts from n_images_per_sample.
Flat patches with n_images == batch_size — legacy one-image-per-sample.

nemo_automodel.components.datasets.vlm.pp_media.prepare_vlm_media_for_pp(
    batch: collections.abc.MutableMapping[str, typing.Any],
    batch_size: int,
    n_microbatches: int
) -> collections.abc.MutableMapping[str, typing.Any]

Move VLM media tensors into pre-chunked PP media storage on the batch.

This is intended to run from VLM collate/dataloader code when PP is enabled. The returned batch no longer carries raw media tensors that PyTorch PP would chunk by row incorrectly; instead it carries VLM_PP_MEDIA_KEY with per-microbatch media chunks.

nemo_automodel.components.datasets.vlm.pp_media.stage_vlm_media_for_pp(
    pp: typing.Any,
    model_parts: list[torch.nn.Module],
    batch: collections.abc.MutableMapping[str, typing.Any]
)

Attach dataloader-prepared VLM media chunks to PP stage 0 for one schedule call.

nemo_automodel.components.datasets.vlm.pp_media.wrap_vlm_collate_for_pp(
    collate_fn: collections.abc.Callable[[Any], collections.abc.MutableMapping[str, typing.Any]],
    n_microbatches: int
) -> collections.abc.Callable[[Any], collections.abc.MutableMapping[str, typing.Any]]

Wrap a VLM collate function so it prepares media tensors for PP.

nemo_automodel.components.datasets.vlm.pp_media.VLM_PP_MEDIA_KEY = '_vlm_pp_media_chunks'

nemo_automodel.components.datasets.vlm.pp_media._VLM_MEDIA_KEYS = ('pixel_values', 'patch_pixel_values', 'num_patches', 'patch_newline_mask', 'ima...

nemo_automodel.components.datasets.vlm.pp_media.__all__ = ['VLM_PP_MEDIA_KEY', 'chunk_vlm_media', 'prepare_vlm_media_for_pp', 'stage_vlm_m...