nemo_automodel.components.datasets.vlm.mock
nemo_automodel.components.datasets.vlm.mock
Mock VLM conversation dataset for benchmarking and testing.
Generates synthetic image(s) and minimal conversations in the standard
Automodel conversation format, compatible with PreTokenizedDatasetWrapper
and any HF AutoProcessor that supports the conversation schema.
The images are random-noise PIL images — no real data download is needed. The processor / vision encoder processes them through the normal pipeline, so this exercises the full VLM training path end-to-end.
When used with pretokenize: true, truncate: true, and max_length
in the dataset config, PreTokenizedDatasetWrapper tokenizes each sample
and truncates to exactly max_length tokens. The mock response is
sized from max_length so that truncation always produces a full-length
sequence.
Module Contents
Functions
Data
API
Generate a dummy response of num_words words from a fixed pool.
Create a random-noise RGB PIL image.
Build a mock VLM dataset in Automodel conversation format.
Each sample is a dict with a "conversation" key whose value is a list
of user/assistant message dicts. User messages contain one or more
{"type": "image", "image": <PIL.Image>} items followed by a text prompt.
Assistant messages contain a single text response.
This is the same format produced by make_rdr_dataset,
make_unimm_chat_dataset, and make_meta_dataset, so the returned
list can be fed directly to PreTokenizedDatasetWrapper.
When max_length is set and responses is None, each sample’s
assistant response is generated with max_length words — guaranteed
to exceed max_length tokens so that PreTokenizedDatasetWrapper
with truncate=True produces exactly max_length tokens per sample.
Parameters:
Number of conversation examples to generate.
Number of random images per user turn.
(width, height) of each generated image.
Text prompt appended after the image(s) in the user turn.
Optional list of assistant responses. Cycled over samples.
Target sequence length. When set (and responses is
None), generates a response of max_length words per sample
so the tokenized sequence always exceeds max_length tokens.
Random seed for reproducibility.
Returns: list
A list of dicts, each with a single "conversation" key.