nemo_automodel.components.datasets.vlm.utils
nemo_automodel.components.datasets.vlm.utils
Module Contents
Functions
Data
API
Build a list of VideoMetadata from preserved _video_fps / _frame_indices.
_preload_media(preserve_video_metadata=True) stores these on each
video content item. Passing the resulting metadata to the processor
ensures correct timestamps and prevents double frame-sampling.
Returns an empty list if no video metadata is found.
Pre-load image and video files in a conversation example.
Images are loaded as PIL RGB Images. Videos are decoded into lists of PIL RGB Images (sampled frames).
When preserve_video_metadata is True, the original video fps and
the sampled frame indices are stored on each video content item as
_video_fps and _frame_indices. This allows downstream code
(e.g. :func:_build_video_metadata) to construct VideoMetadata
for the processor so it inserts correct timestamps.
Read and sample video frames from a video file using decord.
If frame_indices is provided (e.g. from a dataset annotation), those
exact frame numbers are used. Otherwise, frame sampling uses the same
smart_nframes + linspace strategy as qwen_vl_utils to ensure
that preloaded frames are identical to those produced by the processor’s
own video pipeline.
Parameters:
Path to the video file.
HuggingFace processor whose video_processor supplies
default fps / max_frames / min_frames.
Explicit list of 0-based frame indices to extract.
If True, return (frames, video_fps, used_indices)
so callers can preserve timing information for timestamp calculation.
Returns:
list[PIL.Image.Image]: Sampled video frames as RGB PIL Images.
Read an image from an LMDB database.
Paths use the format <lmdb_dir>::<key>, e.g.
/data/my_db.lmdb::0000000087.
Returns:
PIL.Image.Image: The decoded image.
Return default generation stop tokens for a processor tokenizer.
Convert an ordered JSON object into a token sequence.
From NeMo’s automodel_datasets.py
Process a batch of texts and optionally images.
Parameters:
The processor to use for tokenization and image processing
List of text strings to process
Optional list of images to process
Returns: dict[str, torch.Tensor]
Dict containing processed batch data