nemo_automodel.components.datasets.vlm.mock

Mock VLM conversation dataset for benchmarking and testing.

Generates synthetic image(s) and minimal conversations in the standard Automodel conversation format, compatible with PreTokenizedDatasetWrapper and any HF AutoProcessor that supports the conversation schema.

The images are random-noise PIL images — no real data download is needed. The processor / vision encoder processes them through the normal pipeline, so this exercises the full VLM training path end-to-end.

When used with pretokenize: true, truncate: true, and max_length in the dataset config, PreTokenizedDatasetWrapper tokenizes each sample and truncates to exactly max_length tokens. The mock response is sized from max_length so that truncation always produces a full-length sequence.

Module Contents

Functions

Name	Description
`_generate_response`	Generate a dummy response of num_words words from a fixed pool.
`_make_random_image`	Create a random-noise RGB PIL image.
`build_mock_vlm_dataset`	Build a mock VLM dataset in Automodel conversation format.

Data

_WORD_POOL

API

nemo_automodel.components.datasets.vlm.mock._generate_response(
    rng: numpy.random.Generator,
    num_words: int
) -> str

Generate a dummy response of num_words words from a fixed pool.

nemo_automodel.components.datasets.vlm.mock._make_random_image(
    rng: numpy.random.Generator,
    size: typing.Tuple[int, int] = (256, 256)
) -> PIL.Image.Image

Create a random-noise RGB PIL image.

nemo_automodel.components.datasets.vlm.mock.build_mock_vlm_dataset(
    num_samples: int = 10,
    num_images_per_sample: int = 1,
    image_size: typing.Tuple[int, int] = (256, 256),
    prompt: str = 'Describe this image.',
    responses: typing.Optional[typing.List[str]] = None,
    max_length: typing.Optional[int] = None,
    seed: int = 0,
    kwargs = {}
) -> list

Build a mock VLM dataset in Automodel conversation format.

Each sample is a dict with a "conversation" key whose value is a list of user/assistant message dicts. User messages contain one or more {"type": "image", "image": <PIL.Image>} items followed by a text prompt. Assistant messages contain a single text response.

This is the same format produced by make_rdr_dataset, make_unimm_chat_dataset, and make_meta_dataset, so the returned list can be fed directly to PreTokenizedDatasetWrapper.

When max_length is set and responses is None, each sample’s assistant response is generated with max_length words — guaranteed to exceed max_length tokens so that PreTokenizedDatasetWrapper with truncate=True produces exactly max_length tokens per sample.

Parameters:

num_samples

intDefaults to 10

Number of conversation examples to generate.

num_images_per_sample

intDefaults to 1

Number of random images per user turn.

image_size

Tuple[int, int]Defaults to (256, 256)

(width, height) of each generated image.

prompt

strDefaults to 'Describe this image.'

Text prompt appended after the image(s) in the user turn.

responses

Optional[List[str]]Defaults to None

Optional list of assistant responses. Cycled over samples.

max_length

Optional[int]Defaults to None

Target sequence length. When set (and responses is None), generates a response of max_length words per sample so the tokenized sequence always exceeds max_length tokens.

seed

intDefaults to 0

Random seed for reproducibility.

Returns: list

A list of dicts, each with a single "conversation" key.

nemo_automodel.components.datasets.vlm.mock._WORD_POOL = 'the image shows a landscape with mountains and rivers flowing through green val...