nemo_automodel.components.datasets.vlm.datasets#

Module Contents#

Classes#

_ExamplesWithStats

list subclass that carries pre-computed dataset statistics.

PreTokenizedDatasetWrapper

Dataset wrapper that tokenizes samples in __getitem__.

RobustDatasetWrapper

Wrapper that catches __getitem__ and collate errors, substituting random replacement samples.

Functions#

make_rdr_dataset

Load and preprocess the RDR dataset for image-to-text fine-tuning.

make_cord_v2_dataset

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

make_medpix_dataset

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

make_llava_onevision_dataset

Load and preprocess the LLaVA-Instruct-150K dataset for LLaVA-OneVision-1.5.

make_tulu3_magicoder_text_mix_dataset

Build a text-only 80/20 mix of Tulu-3-SFT-mixture and Magicoder-OSS-Instruct-75K.

make_cv17_dataset

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

_decode_audio_cell_to_mono_float32

Decode a HuggingFace Audio(decode=False) cell to a 1-D float32 waveform.

_build_asr_conversation

Assemble the Qwen3-Omni ASR chat-template conversation for one sample.

make_hf_audio_asr_dataset

Lazy HuggingFace audio→text dataset builder for Qwen3-Omni ASR fine-tuning.

make_unimm_chat_dataset

Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning.

_convert_sharegpt_to_conversation

Convert a single sharegpt-format example to Automodel conversation format.

_load_json_or_jsonl

Load data from a JSON or JSONL file.

_load_jsonl_for_rank

Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines.

_collect_sample_stats

Count images, videos, text-only samples and estimate token counts.

_log_dataset_loading_summary

Print a visual summary of per-dataset loading times and data statistics.

make_meta_dataset

Load datasets defined in a meta JSON file and convert to Automodel conversation format.

Data#

API#

nemo_automodel.components.datasets.vlm.datasets.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset(
path_or_dataset='quintend/rdr-items',
split='train',
**kwargs,
)#

Load and preprocess the RDR dataset for image-to-text fine-tuning.

Parameters:
  • path_or_dataset (str) – Path or identifier for the RDR dataset.

  • split (str) – Dataset split to load.

  • **kwargs – Additional arguments.

Returns:

The processed dataset.

Return type:

Dataset

nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset(
path_or_dataset='naver-clova-ix/cord-v2',
split='train',
**kwargs,
)#

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset(
path_or_dataset='medpix-dataset/medpix-dataset',
split='train',
**kwargs,
)#

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets.make_llava_onevision_dataset(
path_or_dataset='liuhaotian/LLaVA-Instruct-150K',
split='train',
**kwargs,
)#

Load and preprocess the LLaVA-Instruct-150K dataset for LLaVA-OneVision-1.5.

This function loads conversation-format data with images and returns it in the standard NeMo VLM format expected by the collate function.

Parameters:
  • path_or_dataset – Path to the dataset on HuggingFace Hub or local path.

  • split – Dataset split to load (e.g., “train”, “train[:1000]”).

  • **kwargs – Additional arguments passed to load_dataset.

Returns:

List of dicts with “conversation” and “image” keys.

nemo_automodel.components.datasets.vlm.datasets.make_tulu3_magicoder_text_mix_dataset(
tulu_split: str = 'train',
magicoder_split: str = 'train',
seed: int = 42,
max_turns: int = 16,
limit_total: int | None = None,
**kwargs,
) list#

Build a text-only 80/20 mix of Tulu-3-SFT-mixture and Magicoder-OSS-Instruct-75K.

Both datasets are converted into the NeMo VLM {"conversation": [...]} shape consumed by :func:nemo_automodel.components.datasets.vlm.collate_fns.default_collate_fn. Because default_collate_fn is image-aware only when a conversation turn contains an {"type": "image", ...} entry, returning text-only conversations here yields batches with no pixel_values / vision tensors – which is what the Gemma 4 base+drafter composite expects for text-only training.

Sources: - allenai/tulu-3-sft-mixture (multi-turn, messages field of {"role", "content"} dicts). - ise-uiuc/Magicoder-OSS-Instruct-75K (2-turn, problem and solution fields).

Mixing uses datasets.interleave_datasets with probabilities [0.8, 0.2] and stopping_strategy="all_exhausted" so both datasets are sampled until every example has been drawn at least once.

Parameters:
  • tulu_split – HF split expression for the Tulu-3 source (e.g. "train" or "train[:50000]").

  • magicoder_split – HF split expression for the Magicoder source.

  • seed – Seed forwarded to interleave_datasets for reproducibility.

  • max_turns – Drop Tulu-3 conversations with more than this many turns to keep memory bounded. Magicoder samples are always 2 turns.

  • limit_total – If set, cap the merged dataset to this many rows.

  • **kwargs – Additional arguments forwarded to load_dataset for both sources.

Returns:

List of {"conversation": [...]} dicts. Each conversation is a list of {"role": "user"|"assistant"|"system", "content": [{"type": "text", "text": ...}]} turns with no image field anywhere in the structure.

nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset(
path_or_dataset='ysdede/commonvoice_17_tr_fixed',
split='train',
**kwargs,
)#

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets._decode_audio_cell_to_mono_float32(audio_cell, target_sampling_rate)#

Decode a HuggingFace Audio(decode=False) cell to a 1-D float32 waveform.

Avoids torchcodec by using soundfile for both byte and path branches, matching the pattern in result/decode_vllm.py.

Parameters:
  • audio_cell – Dict with bytes and/or path keys, as returned by HuggingFace datasets when the column has Audio(decode=False).

  • target_sampling_rate – Desired output sampling rate (Hz). If the source differs, the waveform is resampled via scipy.signal.resample_poly.

Returns:

Tuple of (waveform_float32_mono, target_sampling_rate).

Raises:

ValueError – If both bytes and path are missing.

nemo_automodel.components.datasets.vlm.datasets._build_asr_conversation(
waveform,
transcript,
*,
system_prompt,
user_prompt,
has_system,
has_user_text,
)#

Assemble the Qwen3-Omni ASR chat-template conversation for one sample.

nemo_automodel.components.datasets.vlm.datasets.make_hf_audio_asr_dataset(
path_or_dataset,
split='train',
name=None,
sampling_rate=16000,
system_prompt=None,
user_prompt=None,
audio_column='audio',
text_column='text',
drop_empty_text=True,
min_audio_duration_seconds=None,
**load_kwargs,
)#

Lazy HuggingFace audio→text dataset builder for Qwen3-Omni ASR fine-tuning.

Loads any HuggingFace ASR dataset that exposes an audio column (Audio feature with bytes and/or path populated after cast_column(decode=False)) and a transcript column, and yields the Qwen3-Omni chat-template conversation expected by

Func:

qwen3_omni_asr_collate_fn. No audio is decoded at construction time — both the soundfile decode (mono mix + float32 cast + optional scipy.signal.resample_poly) and the conversation assembly run inside a HuggingFace with_transform callback, so the only fixed startup cost is the Arrow-level metadata read of the parquet shards (and the on-demand download of those shards if they are not already in the HF cache). Empty-transcript filtering happens via dataset.filter against the text column only — also Arrow-level — so audio bytes are never materialized at startup.

Defaults are tuned for the common case (audio / text columns, 16 kHz, no system turn). Datasets that diverge can override per-field via YAML; see :file:docs/guides/audio/qwen3-omni-asr.md for an override table.

The conversation shape follows the prompt-presence matrix:

  • both system_prompt and user_prompt set → system user(text+audio) assistant

  • only system_prompt set → system user(audio) assistant

  • only user_prompt set → user(text+audio) assistant (no system turn)

  • neither set (the default) → user(audio) assistant

Whitespace-only prompts are treated as absent.

Parameters:
  • path_or_dataset – HuggingFace dataset id or local path.

  • split – Dataset split to load (e.g. "train", "train[:5000]").

  • name – Optional dataset configuration / subset. Forwarded to datasets.load_dataset(path, name=name, ...). Required by some datasets (e.g. edinburghcstr/ami needs "ihm" or "sdm"; CommonVoice needs the language code).

  • sampling_rate – Target sampling rate in Hz. Audio is resampled inside the lazy transform if the source rate differs.

  • system_prompt – Instruction placed in a system turn. Default None skips the system turn entirely; pass a string to emit one.

  • user_prompt – Instruction prepended to the audio inside the user turn. Pass None to emit a user turn with only the audio item.

  • audio_column – Name of the audio column in the source dataset (default "audio" — works for AMI / LibriSpeech / GigaSpeech / WenetSpeech / CommonVoice).

  • text_column – Name of the transcript column (default "text" — works for AMI / LibriSpeech / GigaSpeech / WenetSpeech; override to "sentence" for CommonVoice).

  • drop_empty_text – If True, samples whose transcript is empty or whitespace are dropped via dataset.filter (Arrow-level, no audio decode). If False, an empty transcript triggers a ValueError inside the transform at access time.

  • min_audio_duration_seconds – Optional minimum audio duration. Samples shorter than this threshold are dropped via dataset.filter using soundfile.info (header-only read, no full decode). The HF Qwen3-Omni Whisper feature extractor has a known off-by-one between input_features and feature_attention_mask for sub-second clips (~0.27 s manifests as a 27-vs-26 frame mismatch); set this to 1.0 for AMI / CommonVoice-style corpora that contain very short utterances.

  • **load_kwargs – Forwarded to datasets.load_dataset (e.g. trust_remote_code=True).

Returns:

A HuggingFace Dataset whose elements are {"conversation": <chat-template list>} and whose audio is decoded on demand via dataloader workers.

Raises:

ValueError – When audio_column or text_column is missing, when an audio cell has neither bytes nor path, or when drop_empty_text=False and a transcript is empty.

nemo_automodel.components.datasets.vlm.datasets.make_unimm_chat_dataset(
path_or_dataset='Yirany/UniMM-Chat',
split='train',
**kwargs,
)#

Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets._convert_sharegpt_to_conversation(
example,
columns=None,
tags=None,
media_dir=None,
)#

Convert a single sharegpt-format example to Automodel conversation format.

Parameters:
  • example (dict) – A single data example in sharegpt format.

  • columns (dict) – Column name mapping with keys ‘messages’, ‘images’, ‘videos’.

  • tags (dict) – Tag mapping with keys ‘role_tag’, ‘content_tag’, ‘user_tag’, ‘assistant_tag’.

  • media_dir (str | None) – Directory prefix for resolving relative media paths.

Returns:

Example in Automodel conversation format.

Return type:

dict

nemo_automodel.components.datasets.vlm.datasets._load_json_or_jsonl(file_path)#

Load data from a JSON or JSONL file.

Parameters:

file_path (str) – Path to the JSON or JSONL file.

Returns:

List of data examples.

Return type:

list[dict]

nemo_automodel.components.datasets.vlm.datasets._load_jsonl_for_rank(file_path, sample_ratio, rank, world_size)#

Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines.

Handles sample_ratio and sharding so that each rank only parses and stores its own subset. The semantics match the original load-all-then-slice approach: 1. Apply sample_ratio (deterministic Random(42).sample) on the full index range. 2. Shard the resulting list with [rank::world_size].

Returns:

(parsed examples for this rank, total line count).

Return type:

tuple[list[dict], int]

nemo_automodel.components.datasets.vlm.datasets._collect_sample_stats(examples)#

Count images, videos, text-only samples and estimate token counts.

Token estimation mirrors the logic in LengthGroupedSampler._estimate_tokens:

  • Text tokens: uses pre-computed _text_tokens when present (written by scripts/precompute_tokens.py), otherwise falls back to chars // 3.

  • Media tokens: uses mm_inputs_meta image/video dimensions when present (populated by the precompute script), otherwise 500 per media item.

Returns:

dict with keys n_images, n_videos, n_text_only, n_text_tokens, n_media_tokens, n_missing_text_tokens, n_missing_mm_inputs_meta. n_text_tokens + n_media_tokens gives the best available estimate of total training tokens.

nemo_automodel.components.datasets.vlm.datasets._log_dataset_loading_summary(
timings,
wall_time,
total_samples,
rank=None,
)#

Print a visual summary of per-dataset loading times and data statistics.

class nemo_automodel.components.datasets.vlm.datasets._ExamplesWithStats#

Bases: list

list subclass that carries pre-computed dataset statistics.

Attached by :func:make_meta_dataset so downstream code (e.g. _log_global_dataset_stats) can read aggregated stats without re-scanning all examples.

Initialization

Initialize self. See help(type(self)) for accurate signature.

__slots__#

(‘stats’,)

nemo_automodel.components.datasets.vlm.datasets.make_meta_dataset(
path_or_dataset,
dataset_names=None,
split='train',
shard_data=False,
rank=None,
world_size=None,
**kwargs,
)#

Load datasets defined in a meta JSON file and convert to Automodel conversation format.

The meta JSON file maps dataset names to their configurations. Each configuration can have: - file_name (str): Path to the data file (JSON/JSONL). Relative paths are resolved against the meta file’s directory. - columns (dict): Column name mapping (messages, images, videos). - tags (dict): Tag mapping (role_tag, content_tag, user_tag, assistant_tag). - media_dir (str): Directory prefix for media files. - sample_ratio (float): Sampling ratio (0.0 to 1.0, default 1.0).

When shard_data=True, each rank loads only its 1/world_size slice of every dataset file (interleaved: raw_data[rank::world_size]). This reduces per-rank memory and I/O. The caller should use a local sampler (e.g. RandomSampler) instead of DistributedSampler since data is already partitioned.

Video frame sampling (fps, min_frames, max_frames) should be configured on the processor rather than here. For example in YAML::

processor:
  _target_: transformers.AutoProcessor.from_pretrained
  pretrained_model_name_or_path: ...
  fps: 1
  min_frames: 4
  max_frames: 128

Example meta JSON::

{
    "my_dataset": {
        "file_name": "data/train.jsonl",
        "columns": {"messages": "conversations"},
        "media_dir": "/data/media"
    }
}
Parameters:
  • path_or_dataset (str) – Path to the meta JSON file.

  • dataset_names (list[str] | None) – Which datasets to load. None means all.

  • split (str) – Unused, kept for API consistency.

  • shard_data (bool) – If True, each rank loads only its 1/world_size slice.

  • rank (int | None) – Data-parallel rank. Inferred from torch.distributed if None.

  • world_size (int | None) – Data-parallel world size. Inferred from torch.distributed if None.

  • **kwargs – Additional arguments (unused).

Returns:

Combined list of examples in Automodel conversation format.

Return type:

list[dict]

class nemo_automodel.components.datasets.vlm.datasets.PreTokenizedDatasetWrapper(
dataset,
processor,
max_length=None,
max_retries=10,
truncate=False,
post_tokenize_hook=None,
)#

Bases: torch.utils.data.Dataset

Dataset wrapper that tokenizes samples in __getitem__.

Instead of deferring apply_chat_template to the collate function, this wrapper performs tokenization per-sample so that:

  • The collate function only needs to pad and stack.

  • Overlong samples are detected after precise tokenization (including media-token expansion) and replaced with a different random sample.

  • Tokenization work is distributed across DataLoader workers.

Each __getitem__ call returns a dict with at least::

{
    "input_ids":      (seq_len,),
    "attention_mask": (seq_len,),
    "labels":         (seq_len,),
}

Plus optional media tensors (pixel_values, image_grid_thw, pixel_values_videos, video_grid_thw).

Initialization

__len__()#
__getitem__(idx)#
robust_collate(collate_fn)#

Wrap collate_fn so that on failure the entire batch is re-sampled.

class nemo_automodel.components.datasets.vlm.datasets.RobustDatasetWrapper(dataset, max_retries: int = 10)#

Bases: torch.utils.data.Dataset

Wrapper that catches __getitem__ and collate errors, substituting random replacement samples.

This handles failures such as corrupted files, missing media, bad data, or processor errors (e.g. multimodal token mismatch from truncation) without crashing the entire training run.

Initialization

__len__()#
__getitem__(idx)#
robust_collate(collate_fn)#

Wrap a collate_fn so that on failure the entire batch is re-sampled and retried.