nemo_automodel.components.datasets.vlm.utils

Module Contents

Functions

Name	Description
`_build_video_metadata`	Build a list of `VideoMetadata` from preserved `_video_fps` / `_frame_indices`.
`_preload_media`	Pre-load image and video files in a conversation example.
`_read_video_frames`	Read and sample video frames from a video file using decord.
`_resolve_lmdb_image`	Read an image from an LMDB database.
`default_stop_tokens`	Return default generation stop tokens for a processor tokenizer.
`json2token`	Convert an ordered JSON object into a token sequence.
`process_text_batch`	Process a batch of texts and optionally images.

Data

HAVE_LMDB

_lmdb_env_cache

logger

API

nemo_automodel.components.datasets.vlm.utils._build_video_metadata(
    conversation
)

Build a list of VideoMetadata from preserved _video_fps / _frame_indices.

_preload_media(preserve_video_metadata=True) stores these on each video content item. Passing the resulting metadata to the processor ensures correct timestamps and prevents double frame-sampling.

Returns an empty list if no video metadata is found.

nemo_automodel.components.datasets.vlm.utils._preload_media(
    example,
    processor = None,
    preserve_video_metadata = False
)

Pre-load image and video files in a conversation example.

Images are loaded as PIL RGB Images. Videos are decoded into lists of PIL RGB Images (sampled frames).

When preserve_video_metadata is True, the original video fps and the sampled frame indices are stored on each video content item as _video_fps and _frame_indices. This allows downstream code (e.g. :func:_build_video_metadata) to construct VideoMetadata for the processor so it inserts correct timestamps.

nemo_automodel.components.datasets.vlm.utils._read_video_frames(
    video_path,
    processor = None,
    frame_indices = None,
    return_metadata = False
)

Read and sample video frames from a video file using decord.

If frame_indices is provided (e.g. from a dataset annotation), those exact frame numbers are used. Otherwise, frame sampling uses the same smart_nframes + linspace strategy as qwen_vl_utils to ensure that preloaded frames are identical to those produced by the processor’s own video pipeline.

Parameters:

video_path

Path to the video file.

processor

Defaults to None

HuggingFace processor whose video_processor supplies default fps / max_frames / min_frames.

frame_indices

Defaults to None

Explicit list of 0-based frame indices to extract.

return_metadata

Defaults to False

If True, return (frames, video_fps, used_indices) so callers can preserve timing information for timestamp calculation.

Returns:

list[PIL.Image.Image]: Sampled video frames as RGB PIL Images.

nemo_automodel.components.datasets.vlm.utils._resolve_lmdb_image(
    path
)

Read an image from an LMDB database.

Paths use the format <lmdb_dir>::<key>, e.g. /data/my_db.lmdb::0000000087.

Returns:

PIL.Image.Image: The decoded image.

nemo_automodel.components.datasets.vlm.utils.default_stop_tokens(
    processor
) -> typing.Iterable[str]

Return default generation stop tokens for a processor tokenizer.

nemo_automodel.components.datasets.vlm.utils.json2token(
    obj,
    sort_json_key: bool = True
)

Convert an ordered JSON object into a token sequence.

From NeMo’s automodel_datasets.py

nemo_automodel.components.datasets.vlm.utils.process_text_batch(
    processor,
    texts: list[str],
    images: list | None = None
) -> dict[str, torch.Tensor]

Process a batch of texts and optionally images.

Parameters:

processor

The processor to use for tokenization and image processing

texts

list[str]

List of text strings to process

images

list | NoneDefaults to None

Optional list of images to process

Returns: dict[str, torch.Tensor]

Dict containing processed batch data

nemo_automodel.components.datasets.vlm.utils.HAVE_LMDB = True

nemo_automodel.components.datasets.vlm.utils._lmdb_env_cache: dict[str, Environment] = {}

nemo_automodel.components.datasets.vlm.utils.logger = logging.getLogger(__name__)