nemo_automodel.components.datasets.vlm.datasets#
Module Contents#
Classes#
list subclass that carries pre-computed dataset statistics. |
|
Dataset wrapper that tokenizes samples in |
|
Wrapper that catches |
Functions#
Load and preprocess the RDR dataset for image-to-text fine-tuning. |
|
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning. |
|
Load and preprocess the MedPix dataset for image-to-text fine-tuning. |
|
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning. |
|
Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning. |
|
Convert a single sharegpt-format example to Automodel conversation format. |
|
Load data from a JSON or JSONL file. |
|
Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines. |
|
Count images, videos, text-only samples and estimate token counts. |
|
Print a visual summary of per-dataset loading times and data statistics. |
|
Load datasets defined in a meta JSON file and convert to Automodel conversation format. |
Data#
API#
- nemo_automodel.components.datasets.vlm.datasets.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset(
- path_or_dataset='quintend/rdr-items',
- split='train',
- **kwargs,
Load and preprocess the RDR dataset for image-to-text fine-tuning.
- Parameters:
path_or_dataset (str) – Path or identifier for the RDR dataset.
split (str) – Dataset split to load.
**kwargs – Additional arguments.
- Returns:
The processed dataset.
- Return type:
Dataset
- nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset(
- path_or_dataset='naver-clova-ix/cord-v2',
- split='train',
- **kwargs,
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.
- nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset(
- path_or_dataset='medpix-dataset/medpix-dataset',
- split='train',
- **kwargs,
Load and preprocess the MedPix dataset for image-to-text fine-tuning.
- nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset(
- path_or_dataset='ysdede/commonvoice_17_tr_fixed',
- split='train',
- **kwargs,
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.
- nemo_automodel.components.datasets.vlm.datasets.make_unimm_chat_dataset(
- path_or_dataset='Yirany/UniMM-Chat',
- split='train',
- **kwargs,
Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning.
- example,
- columns=None,
- tags=None,
- media_dir=None,
Convert a single sharegpt-format example to Automodel conversation format.
- Parameters:
example (dict) – A single data example in sharegpt format.
columns (dict) – Column name mapping with keys ‘messages’, ‘images’, ‘videos’.
tags (dict) – Tag mapping with keys ‘role_tag’, ‘content_tag’, ‘user_tag’, ‘assistant_tag’.
media_dir (str | None) – Directory prefix for resolving relative media paths.
- Returns:
Example in Automodel conversation format.
- Return type:
dict
- nemo_automodel.components.datasets.vlm.datasets._load_json_or_jsonl(file_path)#
Load data from a JSON or JSONL file.
- Parameters:
file_path (str) – Path to the JSON or JSONL file.
- Returns:
List of data examples.
- Return type:
list[dict]
- nemo_automodel.components.datasets.vlm.datasets._load_jsonl_for_rank(file_path, sample_ratio, rank, world_size)#
Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines.
Handles sample_ratio and sharding so that each rank only parses and stores its own subset. The semantics match the original load-all-then-slice approach: 1. Apply
sample_ratio(deterministicRandom(42).sample) on the full index range. 2. Shard the resulting list with[rank::world_size].- Returns:
(parsed examples for this rank, total line count).
- Return type:
tuple[list[dict], int]
- nemo_automodel.components.datasets.vlm.datasets._collect_sample_stats(examples)#
Count images, videos, text-only samples and estimate token counts.
Token estimation mirrors the logic in LengthGroupedSampler._estimate_tokens:
Text tokens: uses pre-computed
_text_tokenswhen present (written byscripts/precompute_tokens.py), otherwise falls back tochars // 3.Media tokens: uses
mm_inputs_metaimage/video dimensions when present (populated by the precompute script), otherwise500per media item.
- Returns:
dict with keys n_images, n_videos, n_text_only, n_text_tokens, n_media_tokens, n_missing_text_tokens, n_missing_mm_inputs_meta.
n_text_tokens + n_media_tokensgives the best available estimate of total training tokens.
- nemo_automodel.components.datasets.vlm.datasets._log_dataset_loading_summary(
- timings,
- wall_time,
- total_samples,
- rank=None,
Print a visual summary of per-dataset loading times and data statistics.
- class nemo_automodel.components.datasets.vlm.datasets._ExamplesWithStats#
Bases:
listlist subclass that carries pre-computed dataset statistics.
Attached by :func:
make_meta_datasetso downstream code (e.g._log_global_dataset_stats) can read aggregated stats without re-scanning all examples.Initialization
Initialize self. See help(type(self)) for accurate signature.
- __slots__#
(‘stats’,)
- nemo_automodel.components.datasets.vlm.datasets.make_meta_dataset(
- path_or_dataset,
- dataset_names=None,
- split='train',
- shard_data=False,
- rank=None,
- world_size=None,
- **kwargs,
Load datasets defined in a meta JSON file and convert to Automodel conversation format.
The meta JSON file maps dataset names to their configurations. Each configuration can have: - file_name (str): Path to the data file (JSON/JSONL). Relative paths are resolved against the meta file’s directory. - columns (dict): Column name mapping (messages, images, videos). - tags (dict): Tag mapping (role_tag, content_tag, user_tag, assistant_tag). - media_dir (str): Directory prefix for media files. - sample_ratio (float): Sampling ratio (0.0 to 1.0, default 1.0).
When
shard_data=True, each rank loads only its1/world_sizeslice of every dataset file (interleaved:raw_data[rank::world_size]). This reduces per-rank memory and I/O. The caller should use a local sampler (e.g.RandomSampler) instead ofDistributedSamplersince data is already partitioned.Video frame sampling (fps, min_frames, max_frames) should be configured on the processor rather than here. For example in YAML::
processor: _target_: transformers.AutoProcessor.from_pretrained pretrained_model_name_or_path: ... fps: 1 min_frames: 4 max_frames: 128
Example meta JSON::
{ "my_dataset": { "file_name": "data/train.jsonl", "columns": {"messages": "conversations"}, "media_dir": "/data/media" } }- Parameters:
path_or_dataset (str) – Path to the meta JSON file.
dataset_names (list[str] | None) – Which datasets to load. None means all.
split (str) – Unused, kept for API consistency.
shard_data (bool) – If True, each rank loads only its 1/world_size slice.
rank (int | None) – Data-parallel rank. Inferred from torch.distributed if None.
world_size (int | None) – Data-parallel world size. Inferred from torch.distributed if None.
**kwargs – Additional arguments (unused).
- Returns:
Combined list of examples in Automodel conversation format.
- Return type:
list[dict]
- class nemo_automodel.components.datasets.vlm.datasets.PreTokenizedDatasetWrapper(
- dataset,
- processor,
- max_length=None,
- max_retries=10,
Bases:
torch.utils.data.DatasetDataset wrapper that tokenizes samples in
__getitem__.Instead of deferring
apply_chat_templateto the collate function, this wrapper performs tokenization per-sample so that:The collate function only needs to pad and stack.
Overlong samples are detected after precise tokenization (including media-token expansion) and replaced with a different random sample.
Tokenization work is distributed across DataLoader workers.
Each
__getitem__call returns a dict with at least::{ "input_ids": (seq_len,), "attention_mask": (seq_len,), "labels": (seq_len,), }Plus optional media tensors (
pixel_values,image_grid_thw,pixel_values_videos,video_grid_thw).Initialization
- __len__()#
- __getitem__(idx)#
- robust_collate(collate_fn)#
Wrap collate_fn so that on failure the entire batch is re-sampled.
- class nemo_automodel.components.datasets.vlm.datasets.RobustDatasetWrapper(dataset, max_retries: int = 10)#
Bases:
torch.utils.data.DatasetWrapper that catches
__getitem__and collate errors, substituting random replacement samples.This handles failures such as corrupted files, missing media, bad data, or processor errors (e.g. multimodal token mismatch from truncation) without crashing the entire training run.
Initialization
- __len__()#
- __getitem__(idx)#
- robust_collate(collate_fn)#
Wrap a collate_fn so that on failure the entire batch is re-sampled and retried.