nemo_automodel.components.datasets.vlm.utils#

Module Contents#

Functions#

_resolve_lmdb_image

Read an image from an LMDB database.

_read_video_frames

Read and sample video frames from a video file using decord.

_preload_media

Pre-load image and video files in a conversation example.

_build_video_metadata

Build a list of VideoMetadata from preserved _video_fps / _frame_indices.

default_stop_tokens

json2token

Convert an ordered JSON object into a token sequence.

process_text_batch

Process a batch of texts and optionally images.

Data#

API#

nemo_automodel.components.datasets.vlm.utils.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.vlm.utils._lmdb_env_cache: dict[str, lmdb.Environment]#

None

nemo_automodel.components.datasets.vlm.utils._resolve_lmdb_image(path)#

Read an image from an LMDB database.

Paths use the format <lmdb_dir>::<key>, e.g. /data/my_db.lmdb::0000000087.

Returns:

The decoded image.

Return type:

PIL.Image.Image

nemo_automodel.components.datasets.vlm.utils._read_video_frames(
video_path,
processor=None,
frame_indices=None,
return_metadata=False,
)#

Read and sample video frames from a video file using decord.

If frame_indices is provided (e.g. from a dataset annotation), those exact frame numbers are used. Otherwise, frame sampling uses the same smart_nframes + linspace strategy as qwen_vl_utils to ensure that preloaded frames are identical to those produced by the processor’s own video pipeline.

Parameters:
  • video_path – Path to the video file.

  • processor – HuggingFace processor whose video_processor supplies default fps / max_frames / min_frames.

  • frame_indices – Explicit list of 0-based frame indices to extract.

  • return_metadata – If True, return (frames, video_fps, used_indices) so callers can preserve timing information for timestamp calculation.

Returns:

Sampled video frames as RGB PIL Images. If return_metadata is True, returns (frames, video_fps, used_indices) instead.

Return type:

list[PIL.Image.Image]

nemo_automodel.components.datasets.vlm.utils._preload_media(example, processor=None, preserve_video_metadata=False)#

Pre-load image and video files in a conversation example.

Images are loaded as PIL RGB Images. Videos are decoded into lists of PIL RGB Images (sampled frames).

When preserve_video_metadata is True, the original video fps and the sampled frame indices are stored on each video content item as _video_fps and _frame_indices. This allows downstream code (e.g. :func:_build_video_metadata) to construct VideoMetadata for the processor so it inserts correct timestamps.

nemo_automodel.components.datasets.vlm.utils._build_video_metadata(conversation)#

Build a list of VideoMetadata from preserved _video_fps / _frame_indices.

_preload_media(preserve_video_metadata=True) stores these on each video content item. Passing the resulting metadata to the processor ensures correct timestamps and prevents double frame-sampling.

Returns an empty list if no video metadata is found.

nemo_automodel.components.datasets.vlm.utils.default_stop_tokens(processor) Iterable[str]#
nemo_automodel.components.datasets.vlm.utils.json2token(obj, sort_json_key: bool = True)#

Convert an ordered JSON object into a token sequence.

From NeMo’s automodel_datasets.py

nemo_automodel.components.datasets.vlm.utils.process_text_batch(
processor,
texts: list[str],
images: list | None = None,
) dict[str, torch.Tensor]#

Process a batch of texts and optionally images.

Parameters:
  • processor – The processor to use for tokenization and image processing

  • texts – List of text strings to process

  • images – Optional list of images to process

Returns:

Dict containing processed batch data