nemo_automodel.components.datasets.vlm.mock#

Mock VLM conversation dataset for benchmarking and testing.

Generates synthetic image(s) and minimal conversations in the standard Automodel conversation format, compatible with PreTokenizedDatasetWrapper and any HF AutoProcessor that supports the conversation schema.

The images are random-noise PIL images — no real data download is needed. The processor / vision encoder processes them through the normal pipeline, so this exercises the full VLM training path end-to-end.

When used with pretokenize: true, truncate: true, and max_length in the dataset config, PreTokenizedDatasetWrapper tokenizes each sample and truncates to exactly max_length tokens. The mock response is sized from max_length so that truncation always produces a full-length sequence.

Module Contents#

Functions#

_make_random_image

Create a random-noise RGB PIL image.

_generate_response

Generate a dummy response of num_words words from a fixed pool.

build_mock_vlm_dataset

Build a mock VLM dataset in Automodel conversation format.

Data#

API#

nemo_automodel.components.datasets.vlm.mock._WORD_POOL#

‘split(…)’

nemo_automodel.components.datasets.vlm.mock._make_random_image(
rng: numpy.random.Generator,
size: Tuple[int, int] = (256, 256),
) PIL.Image.Image#

Create a random-noise RGB PIL image.

nemo_automodel.components.datasets.vlm.mock._generate_response(rng: numpy.random.Generator, num_words: int) str#

Generate a dummy response of num_words words from a fixed pool.

nemo_automodel.components.datasets.vlm.mock.build_mock_vlm_dataset(
*,
num_samples: int = 10,
num_images_per_sample: int = 1,
image_size: Tuple[int, int] = (256, 256),
prompt: str = 'Describe this image.',
responses: Optional[List[str]] = None,
max_length: Optional[int] = None,
seed: int = 0,
**kwargs,
) list#

Build a mock VLM dataset in Automodel conversation format.

Each sample is a dict with a "conversation" key whose value is a list of user/assistant message dicts. User messages contain one or more {"type": "image", "image": <PIL.Image>} items followed by a text prompt. Assistant messages contain a single text response.

This is the same format produced by make_rdr_dataset, make_unimm_chat_dataset, and make_meta_dataset, so the returned list can be fed directly to PreTokenizedDatasetWrapper.

When max_length is set and responses is None, each sample’s assistant response is generated with max_length words — guaranteed to exceed max_length tokens so that PreTokenizedDatasetWrapper with truncate=True produces exactly max_length tokens per sample.

Parameters:
  • num_samples – Number of conversation examples to generate.

  • num_images_per_sample – Number of random images per user turn.

  • image_size – (width, height) of each generated image.

  • prompt – Text prompt appended after the image(s) in the user turn.

  • responses – Optional list of assistant responses. Cycled over samples.

  • max_length – Target sequence length. When set (and responses is None), generates a response of max_length words per sample so the tokenized sequence always exceeds max_length tokens.

  • seed – Random seed for reproducibility.

Returns:

A list of dicts, each with a single "conversation" key.