`nemo_automodel.components.datasets.vlm.utils`#

Module Contents#

Functions#

`_resolve_lmdb_image`	Read an image from an LMDB database.
`_read_video_frames`	Read and sample video frames from a video file using decord.
`_preload_media`	Pre-load image and video files in a conversation example.
`_build_video_metadata`	Build a list of `VideoMetadata` from preserved `_video_fps` / `_frame_indices`.
`default_stop_tokens`	Return default generation stop tokens for a processor tokenizer.
`json2token`	Convert an ordered JSON object into a token sequence.
`process_text_batch`	Process a batch of texts and optionally images.

Data#

`logger`
`_lmdb_env_cache`

API#

nemo_automodel.components.datasets.vlm.utils.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.vlm.utils._lmdb_env_cache: dict[str, lmdb.Environment]#: None

nemo_automodel.components.datasets.vlm.utils._resolve_lmdb_image(path)#

Read an image from an LMDB database.

Paths use the format <lmdb_dir>::<key>, e.g. /data/my_db.lmdb::0000000087.

Returns:: The decoded image.
Return type:: PIL.Image.Image

nemo_automodel.components.datasets.vlm.utils._read_video_frames( video_path, processor=None, frame_indices=None, return_metadata=False, )#

Read and sample video frames from a video file using decord.

If frame_indices is provided (e.g. from a dataset annotation), those exact frame numbers are used. Otherwise, frame sampling uses the same smart_nframes + linspace strategy as qwen_vl_utils to ensure that preloaded frames are identical to those produced by the processor’s own video pipeline.

Parameters:

video_path – Path to the video file.
processor – HuggingFace processor whose video_processor supplies default fps / max_frames / min_frames.
frame_indices – Explicit list of 0-based frame indices to extract.
return_metadata – If True, return (frames, video_fps, used_indices) so callers can preserve timing information for timestamp calculation.

Returns:

Sampled video frames as RGB PIL Images. If return_metadata is True, returns (frames, video_fps, used_indices) instead.

Return type:

list[PIL.Image.Image]

nemo_automodel.components.datasets.vlm.utils._preload_media(example, processor=None, preserve_video_metadata=False)#

Pre-load image and video files in a conversation example.

Images are loaded as PIL RGB Images. Videos are decoded into lists of PIL RGB Images (sampled frames).

When preserve_video_metadata is True, the original video fps and the sampled frame indices are stored on each video content item as _video_fps and _frame_indices. This allows downstream code (e.g. :func:_build_video_metadata) to construct VideoMetadata for the processor so it inserts correct timestamps.

nemo_automodel.components.datasets.vlm.utils._build_video_metadata(conversation)#

Build a list of VideoMetadata from preserved _video_fps / _frame_indices.

_preload_media(preserve_video_metadata=True) stores these on each video content item. Passing the resulting metadata to the processor ensures correct timestamps and prevents double frame-sampling.

Returns an empty list if no video metadata is found.

nemo_automodel.components.datasets.vlm.utils.default_stop_tokens(processor) → Iterable[str]#: Return default generation stop tokens for a processor tokenizer.

nemo_automodel.components.datasets.vlm.utils.json2token(obj, sort_json_key: bool = True)#

Convert an ordered JSON object into a token sequence.

From NeMo’s automodel_datasets.py

nemo_automodel.components.datasets.vlm.utils.process_text_batch( processor, texts: list[str], images: list | None = None, ) → dict[str, torch.Tensor]#

Process a batch of texts and optionally images.