nemo_automodel.components.datasets.vlm.utils#
Module Contents#
Functions#
Read an image from an LMDB database. |
|
Read and sample video frames from a video file using decord. |
|
Pre-load image and video files in a conversation example. |
|
Build a list of |
|
Convert an ordered JSON object into a token sequence. |
|
Process a batch of texts and optionally images. |
Data#
API#
- nemo_automodel.components.datasets.vlm.utils.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.utils._lmdb_env_cache: dict[str, lmdb.Environment]#
None
- nemo_automodel.components.datasets.vlm.utils._resolve_lmdb_image(path)#
Read an image from an LMDB database.
Paths use the format
<lmdb_dir>::<key>, e.g./data/my_db.lmdb::0000000087.- Returns:
The decoded image.
- Return type:
PIL.Image.Image
- nemo_automodel.components.datasets.vlm.utils._read_video_frames(
- video_path,
- processor=None,
- frame_indices=None,
- return_metadata=False,
Read and sample video frames from a video file using decord.
If frame_indices is provided (e.g. from a dataset annotation), those exact frame numbers are used. Otherwise, frame sampling uses the same
smart_nframes+linspacestrategy asqwen_vl_utilsto ensure that preloaded frames are identical to those produced by the processor’s own video pipeline.- Parameters:
video_path – Path to the video file.
processor – HuggingFace processor whose
video_processorsupplies default fps / max_frames / min_frames.frame_indices – Explicit list of 0-based frame indices to extract.
return_metadata – If True, return
(frames, video_fps, used_indices)so callers can preserve timing information for timestamp calculation.
- Returns:
Sampled video frames as RGB PIL Images. If return_metadata is True, returns
(frames, video_fps, used_indices)instead.- Return type:
list[PIL.Image.Image]
- nemo_automodel.components.datasets.vlm.utils._preload_media(example, processor=None, preserve_video_metadata=False)#
Pre-load image and video files in a conversation example.
Images are loaded as PIL RGB Images. Videos are decoded into lists of PIL RGB Images (sampled frames).
When preserve_video_metadata is
True, the original video fps and the sampled frame indices are stored on each video content item as_video_fpsand_frame_indices. This allows downstream code (e.g. :func:_build_video_metadata) to constructVideoMetadatafor the processor so it inserts correct timestamps.
- nemo_automodel.components.datasets.vlm.utils._build_video_metadata(conversation)#
Build a list of
VideoMetadatafrom preserved_video_fps/_frame_indices._preload_media(preserve_video_metadata=True)stores these on each video content item. Passing the resulting metadata to the processor ensures correct timestamps and prevents double frame-sampling.Returns an empty list if no video metadata is found.
- nemo_automodel.components.datasets.vlm.utils.default_stop_tokens(processor) Iterable[str]#
- nemo_automodel.components.datasets.vlm.utils.json2token(obj, sort_json_key: bool = True)#
Convert an ordered JSON object into a token sequence.
From NeMo’s automodel_datasets.py
- nemo_automodel.components.datasets.vlm.utils.process_text_batch(
- processor,
- texts: list[str],
- images: list | None = None,
Process a batch of texts and optionally images.
- Parameters:
processor – The processor to use for tokenization and image processing
texts – List of text strings to process
images – Optional list of images to process
- Returns:
Dict containing processed batch data