> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.vlm.utils

## Module Contents

### Functions

| Name                                                                                           | Description                                                                     |
| ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [`_build_video_metadata`](#nemo_automodel-components-datasets-vlm-utils-_build_video_metadata) | Build a list of `VideoMetadata` from preserved `_video_fps` / `_frame_indices`. |
| [`_preload_media`](#nemo_automodel-components-datasets-vlm-utils-_preload_media)               | Pre-load image and video files in a conversation example.                       |
| [`_read_video_frames`](#nemo_automodel-components-datasets-vlm-utils-_read_video_frames)       | Read and sample video frames from a video file using decord.                    |
| [`_resolve_lmdb_image`](#nemo_automodel-components-datasets-vlm-utils-_resolve_lmdb_image)     | Read an image from an LMDB database.                                            |
| [`default_stop_tokens`](#nemo_automodel-components-datasets-vlm-utils-default_stop_tokens)     | Return default generation stop tokens for a processor tokenizer.                |
| [`json2token`](#nemo_automodel-components-datasets-vlm-utils-json2token)                       | Convert an ordered JSON object into a token sequence.                           |
| [`process_text_batch`](#nemo_automodel-components-datasets-vlm-utils-process_text_batch)       | Process a batch of texts and optionally images.                                 |

### Data

[`HAVE_LMDB`](#nemo_automodel-components-datasets-vlm-utils-HAVE_LMDB)

[`_lmdb_env_cache`](#nemo_automodel-components-datasets-vlm-utils-_lmdb_env_cache)

[`logger`](#nemo_automodel-components-datasets-vlm-utils-logger)

### API

```python
nemo_automodel.components.datasets.vlm.utils._build_video_metadata(
    conversation
)
```

Build a list of `VideoMetadata` from preserved `_video_fps` / `_frame_indices`.

`_preload_media(preserve_video_metadata=True)` stores these on each
video content item.  Passing the resulting metadata to the processor
ensures correct timestamps and prevents double frame-sampling.

Returns an empty list if no video metadata is found.

```python
nemo_automodel.components.datasets.vlm.utils._preload_media(
    example,
    processor = None,
    preserve_video_metadata = False
)
```

Pre-load image and video files in a conversation example.

Images are loaded as PIL RGB Images.
Videos are decoded into lists of PIL RGB Images (sampled frames).

When *preserve\_video\_metadata* is `True`, the original video fps and
the sampled frame indices are stored on each video content item as
`_video_fps` and `_frame_indices`.  This allows downstream code
(e.g. :func:`_build_video_metadata`) to construct `VideoMetadata`
for the processor so it inserts correct timestamps.

```python
nemo_automodel.components.datasets.vlm.utils._read_video_frames(
    video_path,
    processor = None,
    frame_indices = None,
    return_metadata = False
)
```

Read and sample video frames from a video file using decord.

If *frame\_indices* is provided (e.g. from a dataset annotation), those
exact frame numbers are used.  Otherwise, frame sampling uses the same
`smart_nframes` + `linspace` strategy as `qwen_vl_utils` to ensure
that preloaded frames are identical to those produced by the processor's
own video pipeline.

**Parameters:**

Path to the video file.

HuggingFace processor whose `video_processor` supplies
default fps / max\_frames / min\_frames.

Explicit list of 0-based frame indices to extract.

If True, return `(frames, video_fps, used_indices)`
so callers can preserve timing information for timestamp calculation.

**Returns:**

list\[PIL.Image.Image]: Sampled video frames as RGB PIL Images.

```python
nemo_automodel.components.datasets.vlm.utils._resolve_lmdb_image(
    path
)
```

Read an image from an LMDB database.

Paths use the format `&lt;lmdb_dir&gt;::&lt;key&gt;`, e.g.
`/data/my_db.lmdb::0000000087`.

**Returns:**

PIL.Image.Image: The decoded image.

```python
nemo_automodel.components.datasets.vlm.utils.default_stop_tokens(
    processor
) -> typing.Iterable[str]
```

Return default generation stop tokens for a processor tokenizer.

```python
nemo_automodel.components.datasets.vlm.utils.json2token(
    obj,
    sort_json_key: bool = True
)
```

Convert an ordered JSON object into a token sequence.

From NeMo's automodel\_datasets.py

```python
nemo_automodel.components.datasets.vlm.utils.process_text_batch(
    processor,
    texts: list[str],
    images: list | None = None
) -> dict[str, torch.Tensor]
```

Process a batch of texts and optionally images.

**Parameters:**

The processor to use for tokenization and image processing

List of text strings to process

Optional list of images to process

**Returns:** `dict[str, torch.Tensor]`

Dict containing processed batch data

```python
nemo_automodel.components.datasets.vlm.utils.HAVE_LMDB = True
```

```python
nemo_automodel.components.datasets.vlm.utils._lmdb_env_cache: dict[str, Environment] = {}
```

```python
nemo_automodel.components.datasets.vlm.utils.logger = logging.getLogger(__name__)
```