nemo_automodel.components.datasets.vlm.datasets#
Module Contents#
Classes#
list subclass that carries pre-computed dataset statistics. |
|
Dataset wrapper that tokenizes samples in |
|
Wrapper that catches |
Functions#
Load and preprocess the RDR dataset for image-to-text fine-tuning. |
|
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning. |
|
Load and preprocess the MedPix dataset for image-to-text fine-tuning. |
|
Load and preprocess the LLaVA-Instruct-150K dataset for LLaVA-OneVision-1.5. |
|
Build a text-only 80/20 mix of Tulu-3-SFT-mixture and Magicoder-OSS-Instruct-75K. |
|
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning. |
|
Decode a HuggingFace |
|
Assemble the Qwen3-Omni ASR chat-template conversation for one sample. |
|
Lazy HuggingFace audio→text dataset builder for Qwen3-Omni ASR fine-tuning. |
|
Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning. |
|
Convert a single sharegpt-format example to Automodel conversation format. |
|
Load data from a JSON or JSONL file. |
|
Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines. |
|
Count images, videos, text-only samples and estimate token counts. |
|
Print a visual summary of per-dataset loading times and data statistics. |
|
Load datasets defined in a meta JSON file and convert to Automodel conversation format. |
Data#
API#
- nemo_automodel.components.datasets.vlm.datasets.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset(
- path_or_dataset='quintend/rdr-items',
- split='train',
- **kwargs,
Load and preprocess the RDR dataset for image-to-text fine-tuning.
- Parameters:
path_or_dataset (str) – Path or identifier for the RDR dataset.
split (str) – Dataset split to load.
**kwargs – Additional arguments.
- Returns:
The processed dataset.
- Return type:
Dataset
- nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset(
- path_or_dataset='naver-clova-ix/cord-v2',
- split='train',
- **kwargs,
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.
- nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset(
- path_or_dataset='medpix-dataset/medpix-dataset',
- split='train',
- **kwargs,
Load and preprocess the MedPix dataset for image-to-text fine-tuning.
- nemo_automodel.components.datasets.vlm.datasets.make_llava_onevision_dataset(
- path_or_dataset='liuhaotian/LLaVA-Instruct-150K',
- split='train',
- **kwargs,
Load and preprocess the LLaVA-Instruct-150K dataset for LLaVA-OneVision-1.5.
This function loads conversation-format data with images and returns it in the standard NeMo VLM format expected by the collate function.
- Parameters:
path_or_dataset – Path to the dataset on HuggingFace Hub or local path.
split – Dataset split to load (e.g., “train”, “train[:1000]”).
**kwargs – Additional arguments passed to load_dataset.
- Returns:
List of dicts with “conversation” and “image” keys.
- nemo_automodel.components.datasets.vlm.datasets.make_tulu3_magicoder_text_mix_dataset(
- tulu_split: str = 'train',
- magicoder_split: str = 'train',
- seed: int = 42,
- max_turns: int = 16,
- limit_total: int | None = None,
- **kwargs,
Build a text-only 80/20 mix of Tulu-3-SFT-mixture and Magicoder-OSS-Instruct-75K.
Both datasets are converted into the NeMo VLM
{"conversation": [...]}shape consumed by :func:nemo_automodel.components.datasets.vlm.collate_fns.default_collate_fn. Becausedefault_collate_fnis image-aware only when a conversation turn contains an{"type": "image", ...}entry, returning text-only conversations here yields batches with nopixel_values/ vision tensors – which is what the Gemma 4 base+drafter composite expects for text-only training.Sources: -
allenai/tulu-3-sft-mixture(multi-turn,messagesfield of{"role", "content"}dicts). -ise-uiuc/Magicoder-OSS-Instruct-75K(2-turn,problemandsolutionfields).Mixing uses
datasets.interleave_datasetswith probabilities[0.8, 0.2]andstopping_strategy="all_exhausted"so both datasets are sampled until every example has been drawn at least once.- Parameters:
tulu_split – HF split expression for the Tulu-3 source (e.g.
"train"or"train[:50000]").magicoder_split – HF split expression for the Magicoder source.
seed – Seed forwarded to
interleave_datasetsfor reproducibility.max_turns – Drop Tulu-3 conversations with more than this many turns to keep memory bounded. Magicoder samples are always 2 turns.
limit_total – If set, cap the merged dataset to this many rows.
**kwargs – Additional arguments forwarded to
load_datasetfor both sources.
- Returns:
List of
{"conversation": [...]}dicts. Each conversation is a list of{"role": "user"|"assistant"|"system", "content": [{"type": "text", "text": ...}]}turns with noimagefield anywhere in the structure.
- nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset(
- path_or_dataset='ysdede/commonvoice_17_tr_fixed',
- split='train',
- **kwargs,
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.
- nemo_automodel.components.datasets.vlm.datasets._decode_audio_cell_to_mono_float32(audio_cell, target_sampling_rate)#
Decode a HuggingFace
Audio(decode=False)cell to a 1-D float32 waveform.Avoids
torchcodecby usingsoundfilefor both byte and path branches, matching the pattern inresult/decode_vllm.py.- Parameters:
audio_cell – Dict with
bytesand/orpathkeys, as returned by HuggingFacedatasetswhen the column hasAudio(decode=False).target_sampling_rate – Desired output sampling rate (Hz). If the source differs, the waveform is resampled via
scipy.signal.resample_poly.
- Returns:
Tuple of
(waveform_float32_mono, target_sampling_rate).- Raises:
ValueError – If both
bytesandpathare missing.
- nemo_automodel.components.datasets.vlm.datasets._build_asr_conversation(
- waveform,
- transcript,
- *,
- system_prompt,
- user_prompt,
- has_system,
- has_user_text,
Assemble the Qwen3-Omni ASR chat-template conversation for one sample.
- nemo_automodel.components.datasets.vlm.datasets.make_hf_audio_asr_dataset(
- path_or_dataset,
- split='train',
- name=None,
- sampling_rate=16000,
- system_prompt=None,
- user_prompt=None,
- audio_column='audio',
- text_column='text',
- drop_empty_text=True,
- min_audio_duration_seconds=None,
- **load_kwargs,
Lazy HuggingFace audio→text dataset builder for Qwen3-Omni ASR fine-tuning.
Loads any HuggingFace ASR dataset that exposes an audio column (
Audiofeature withbytesand/orpathpopulated aftercast_column(decode=False)) and a transcript column, and yields the Qwen3-Omni chat-template conversation expected by- Func:
qwen3_omni_asr_collate_fn. No audio is decoded at construction time — both the soundfile decode (mono mix +float32cast + optionalscipy.signal.resample_poly) and the conversation assembly run inside a HuggingFacewith_transformcallback, so the only fixed startup cost is the Arrow-level metadata read of the parquet shards (and the on-demand download of those shards if they are not already in the HF cache). Empty-transcript filtering happens viadataset.filteragainst the text column only — also Arrow-level — so audio bytes are never materialized at startup.
Defaults are tuned for the common case (
audio/textcolumns, 16 kHz, no system turn). Datasets that diverge can override per-field via YAML; see :file:docs/guides/audio/qwen3-omni-asr.mdfor an override table.The conversation shape follows the prompt-presence matrix:
both
system_promptanduser_promptset →system → user(text+audio) → assistantonly
system_promptset →system → user(audio) → assistantonly
user_promptset →user(text+audio) → assistant(no system turn)neither set (the default) →
user(audio) → assistant
Whitespace-only prompts are treated as absent.
- Parameters:
path_or_dataset – HuggingFace dataset id or local path.
split – Dataset split to load (e.g.
"train","train[:5000]").name – Optional dataset configuration / subset. Forwarded to
datasets.load_dataset(path, name=name, ...). Required by some datasets (e.g.edinburghcstr/amineeds"ihm"or"sdm"; CommonVoice needs the language code).sampling_rate – Target sampling rate in Hz. Audio is resampled inside the lazy transform if the source rate differs.
system_prompt – Instruction placed in a
systemturn. DefaultNoneskips the system turn entirely; pass a string to emit one.user_prompt – Instruction prepended to the audio inside the user turn. Pass
Noneto emit a user turn with only the audio item.audio_column – Name of the audio column in the source dataset (default
"audio"— works for AMI / LibriSpeech / GigaSpeech / WenetSpeech / CommonVoice).text_column – Name of the transcript column (default
"text"— works for AMI / LibriSpeech / GigaSpeech / WenetSpeech; override to"sentence"for CommonVoice).drop_empty_text – If True, samples whose transcript is empty or whitespace are dropped via
dataset.filter(Arrow-level, no audio decode). If False, an empty transcript triggers aValueErrorinside the transform at access time.min_audio_duration_seconds – Optional minimum audio duration. Samples shorter than this threshold are dropped via
dataset.filterusingsoundfile.info(header-only read, no full decode). The HF Qwen3-Omni Whisper feature extractor has a known off-by-one betweeninput_featuresandfeature_attention_maskfor sub-second clips (~0.27 s manifests as a 27-vs-26 frame mismatch); set this to1.0for AMI / CommonVoice-style corpora that contain very short utterances.**load_kwargs – Forwarded to
datasets.load_dataset(e.g.trust_remote_code=True).
- Returns:
A HuggingFace
Datasetwhose elements are{"conversation": <chat-template list>}and whose audio is decoded on demand via dataloader workers.- Raises:
ValueError – When
audio_columnortext_columnis missing, when an audio cell has neitherbytesnorpath, or whendrop_empty_text=Falseand a transcript is empty.
- nemo_automodel.components.datasets.vlm.datasets.make_unimm_chat_dataset(
- path_or_dataset='Yirany/UniMM-Chat',
- split='train',
- **kwargs,
Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning.
- example,
- columns=None,
- tags=None,
- media_dir=None,
Convert a single sharegpt-format example to Automodel conversation format.
- Parameters:
example (dict) – A single data example in sharegpt format.
columns (dict) – Column name mapping with keys ‘messages’, ‘images’, ‘videos’.
tags (dict) – Tag mapping with keys ‘role_tag’, ‘content_tag’, ‘user_tag’, ‘assistant_tag’.
media_dir (str | None) – Directory prefix for resolving relative media paths.
- Returns:
Example in Automodel conversation format.
- Return type:
dict
- nemo_automodel.components.datasets.vlm.datasets._load_json_or_jsonl(file_path)#
Load data from a JSON or JSONL file.
- Parameters:
file_path (str) – Path to the JSON or JSONL file.
- Returns:
List of data examples.
- Return type:
list[dict]
- nemo_automodel.components.datasets.vlm.datasets._load_jsonl_for_rank(file_path, sample_ratio, rank, world_size)#
Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines.
Handles sample_ratio and sharding so that each rank only parses and stores its own subset. The semantics match the original load-all-then-slice approach: 1. Apply
sample_ratio(deterministicRandom(42).sample) on the full index range. 2. Shard the resulting list with[rank::world_size].- Returns:
(parsed examples for this rank, total line count).
- Return type:
tuple[list[dict], int]
- nemo_automodel.components.datasets.vlm.datasets._collect_sample_stats(examples)#
Count images, videos, text-only samples and estimate token counts.
Token estimation mirrors the logic in LengthGroupedSampler._estimate_tokens:
Text tokens: uses pre-computed
_text_tokenswhen present (written byscripts/precompute_tokens.py), otherwise falls back tochars // 3.Media tokens: uses
mm_inputs_metaimage/video dimensions when present (populated by the precompute script), otherwise500per media item.
- Returns:
dict with keys n_images, n_videos, n_text_only, n_text_tokens, n_media_tokens, n_missing_text_tokens, n_missing_mm_inputs_meta.
n_text_tokens + n_media_tokensgives the best available estimate of total training tokens.
- nemo_automodel.components.datasets.vlm.datasets._log_dataset_loading_summary(
- timings,
- wall_time,
- total_samples,
- rank=None,
Print a visual summary of per-dataset loading times and data statistics.
- class nemo_automodel.components.datasets.vlm.datasets._ExamplesWithStats#
Bases:
listlist subclass that carries pre-computed dataset statistics.
Attached by :func:
make_meta_datasetso downstream code (e.g._log_global_dataset_stats) can read aggregated stats without re-scanning all examples.Initialization
Initialize self. See help(type(self)) for accurate signature.
- __slots__#
(‘stats’,)
- nemo_automodel.components.datasets.vlm.datasets.make_meta_dataset(
- path_or_dataset,
- dataset_names=None,
- split='train',
- shard_data=False,
- rank=None,
- world_size=None,
- **kwargs,
Load datasets defined in a meta JSON file and convert to Automodel conversation format.
The meta JSON file maps dataset names to their configurations. Each configuration can have: - file_name (str): Path to the data file (JSON/JSONL). Relative paths are resolved against the meta file’s directory. - columns (dict): Column name mapping (messages, images, videos). - tags (dict): Tag mapping (role_tag, content_tag, user_tag, assistant_tag). - media_dir (str): Directory prefix for media files. - sample_ratio (float): Sampling ratio (0.0 to 1.0, default 1.0).
When
shard_data=True, each rank loads only its1/world_sizeslice of every dataset file (interleaved:raw_data[rank::world_size]). This reduces per-rank memory and I/O. The caller should use a local sampler (e.g.RandomSampler) instead ofDistributedSamplersince data is already partitioned.Video frame sampling (fps, min_frames, max_frames) should be configured on the processor rather than here. For example in YAML::
processor: _target_: transformers.AutoProcessor.from_pretrained pretrained_model_name_or_path: ... fps: 1 min_frames: 4 max_frames: 128
Example meta JSON::
{ "my_dataset": { "file_name": "data/train.jsonl", "columns": {"messages": "conversations"}, "media_dir": "/data/media" } }- Parameters:
path_or_dataset (str) – Path to the meta JSON file.
dataset_names (list[str] | None) – Which datasets to load. None means all.
split (str) – Unused, kept for API consistency.
shard_data (bool) – If True, each rank loads only its 1/world_size slice.
rank (int | None) – Data-parallel rank. Inferred from torch.distributed if None.
world_size (int | None) – Data-parallel world size. Inferred from torch.distributed if None.
**kwargs – Additional arguments (unused).
- Returns:
Combined list of examples in Automodel conversation format.
- Return type:
list[dict]
- class nemo_automodel.components.datasets.vlm.datasets.PreTokenizedDatasetWrapper(
- dataset,
- processor,
- max_length=None,
- max_retries=10,
- truncate=False,
- post_tokenize_hook=None,
Bases:
torch.utils.data.DatasetDataset wrapper that tokenizes samples in
__getitem__.Instead of deferring
apply_chat_templateto the collate function, this wrapper performs tokenization per-sample so that:The collate function only needs to pad and stack.
Overlong samples are detected after precise tokenization (including media-token expansion) and replaced with a different random sample.
Tokenization work is distributed across DataLoader workers.
Each
__getitem__call returns a dict with at least::{ "input_ids": (seq_len,), "attention_mask": (seq_len,), "labels": (seq_len,), }Plus optional media tensors (
pixel_values,image_grid_thw,pixel_values_videos,video_grid_thw).Initialization
- __len__()#
- __getitem__(idx)#
- robust_collate(collate_fn)#
Wrap collate_fn so that on failure the entire batch is re-sampled.
- class nemo_automodel.components.datasets.vlm.datasets.RobustDatasetWrapper(dataset, max_retries: int = 10)#
Bases:
torch.utils.data.DatasetWrapper that catches
__getitem__and collate errors, substituting random replacement samples.This handles failures such as corrupted files, missing media, bad data, or processor errors (e.g. multimodal token mismatch from truncation) without crashing the entire training run.
Initialization
- __len__()#
- __getitem__(idx)#
- robust_collate(collate_fn)#
Wrap a collate_fn so that on failure the entire batch is re-sampled and retried.