nemo_automodel.components.datasets.vlm.utils

View as Markdown

Module Contents

Functions

NameDescription
_build_video_metadataBuild a list of VideoMetadata from preserved _video_fps / _frame_indices.
_preload_mediaPre-load image and video files in a conversation example.
_read_video_framesRead and sample video frames from a video file using decord.
_resolve_lmdb_imageRead an image from an LMDB database.
default_stop_tokensReturn default generation stop tokens for a processor tokenizer.
json2tokenConvert an ordered JSON object into a token sequence.
process_text_batchProcess a batch of texts and optionally images.

Data

HAVE_LMDB

_lmdb_env_cache

logger

API

nemo_automodel.components.datasets.vlm.utils._build_video_metadata(
conversation
)

Build a list of VideoMetadata from preserved _video_fps / _frame_indices.

_preload_media(preserve_video_metadata=True) stores these on each video content item. Passing the resulting metadata to the processor ensures correct timestamps and prevents double frame-sampling.

Returns an empty list if no video metadata is found.

nemo_automodel.components.datasets.vlm.utils._preload_media(
example,
processor = None,
preserve_video_metadata = False
)

Pre-load image and video files in a conversation example.

Images are loaded as PIL RGB Images. Videos are decoded into lists of PIL RGB Images (sampled frames).

When preserve_video_metadata is True, the original video fps and the sampled frame indices are stored on each video content item as _video_fps and _frame_indices. This allows downstream code (e.g. :func:_build_video_metadata) to construct VideoMetadata for the processor so it inserts correct timestamps.

nemo_automodel.components.datasets.vlm.utils._read_video_frames(
video_path,
processor = None,
frame_indices = None,
return_metadata = False
)

Read and sample video frames from a video file using decord.

If frame_indices is provided (e.g. from a dataset annotation), those exact frame numbers are used. Otherwise, frame sampling uses the same smart_nframes + linspace strategy as qwen_vl_utils to ensure that preloaded frames are identical to those produced by the processor’s own video pipeline.

Parameters:

video_path

Path to the video file.

processor
Defaults to None

HuggingFace processor whose video_processor supplies default fps / max_frames / min_frames.

frame_indices
Defaults to None

Explicit list of 0-based frame indices to extract.

return_metadata
Defaults to False

If True, return (frames, video_fps, used_indices) so callers can preserve timing information for timestamp calculation.

Returns:

list[PIL.Image.Image]: Sampled video frames as RGB PIL Images.

nemo_automodel.components.datasets.vlm.utils._resolve_lmdb_image(
path
)

Read an image from an LMDB database.

Paths use the format <lmdb_dir>::<key>, e.g. /data/my_db.lmdb::0000000087.

Returns:

PIL.Image.Image: The decoded image.

nemo_automodel.components.datasets.vlm.utils.default_stop_tokens(
processor
) -> typing.Iterable[str]

Return default generation stop tokens for a processor tokenizer.

nemo_automodel.components.datasets.vlm.utils.json2token(
obj,
sort_json_key: bool = True
)

Convert an ordered JSON object into a token sequence.

From NeMo’s automodel_datasets.py

nemo_automodel.components.datasets.vlm.utils.process_text_batch(
processor,
texts: list[str],
images: list | None = None
) -> dict[str, torch.Tensor]

Process a batch of texts and optionally images.

Parameters:

processor

The processor to use for tokenization and image processing

texts
list[str]

List of text strings to process

images
list | NoneDefaults to None

Optional list of images to process

Returns: dict[str, torch.Tensor]

Dict containing processed batch data

nemo_automodel.components.datasets.vlm.utils.HAVE_LMDB = True
nemo_automodel.components.datasets.vlm.utils._lmdb_env_cache: dict[str, Environment] = {}
nemo_automodel.components.datasets.vlm.utils.logger = logging.getLogger(__name__)