nemo_automodel.components.datasets.vlm.datasets#

Module Contents#

Classes#

_ExamplesWithStats

list subclass that carries pre-computed dataset statistics.

PreTokenizedDatasetWrapper

Dataset wrapper that tokenizes samples in __getitem__.

RobustDatasetWrapper

Wrapper that catches __getitem__ and collate errors, substituting random replacement samples.

Functions#

make_rdr_dataset

Load and preprocess the RDR dataset for image-to-text fine-tuning.

make_cord_v2_dataset

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

make_medpix_dataset

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

make_cv17_dataset

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

make_unimm_chat_dataset

Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning.

_convert_sharegpt_to_conversation

Convert a single sharegpt-format example to Automodel conversation format.

_load_json_or_jsonl

Load data from a JSON or JSONL file.

_load_jsonl_for_rank

Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines.

_collect_sample_stats

Count images, videos, text-only samples and estimate token counts.

_log_dataset_loading_summary

Print a visual summary of per-dataset loading times and data statistics.

make_meta_dataset

Load datasets defined in a meta JSON file and convert to Automodel conversation format.

Data#

API#

nemo_automodel.components.datasets.vlm.datasets.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset(
path_or_dataset='quintend/rdr-items',
split='train',
**kwargs,
)#

Load and preprocess the RDR dataset for image-to-text fine-tuning.

Parameters:
  • path_or_dataset (str) – Path or identifier for the RDR dataset.

  • split (str) – Dataset split to load.

  • **kwargs – Additional arguments.

Returns:

The processed dataset.

Return type:

Dataset

nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset(
path_or_dataset='naver-clova-ix/cord-v2',
split='train',
**kwargs,
)#

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset(
path_or_dataset='medpix-dataset/medpix-dataset',
split='train',
**kwargs,
)#

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset(
path_or_dataset='ysdede/commonvoice_17_tr_fixed',
split='train',
**kwargs,
)#

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets.make_unimm_chat_dataset(
path_or_dataset='Yirany/UniMM-Chat',
split='train',
**kwargs,
)#

Load and preprocess the UniMM-Chat dataset for image-to-text fine-tuning.

nemo_automodel.components.datasets.vlm.datasets._convert_sharegpt_to_conversation(
example,
columns=None,
tags=None,
media_dir=None,
)#

Convert a single sharegpt-format example to Automodel conversation format.

Parameters:
  • example (dict) – A single data example in sharegpt format.

  • columns (dict) – Column name mapping with keys ‘messages’, ‘images’, ‘videos’.

  • tags (dict) – Tag mapping with keys ‘role_tag’, ‘content_tag’, ‘user_tag’, ‘assistant_tag’.

  • media_dir (str | None) – Directory prefix for resolving relative media paths.

Returns:

Example in Automodel conversation format.

Return type:

dict

nemo_automodel.components.datasets.vlm.datasets._load_json_or_jsonl(file_path)#

Load data from a JSON or JSONL file.

Parameters:

file_path (str) – Path to the JSON or JSONL file.

Returns:

List of data examples.

Return type:

list[dict]

nemo_automodel.components.datasets.vlm.datasets._load_jsonl_for_rank(file_path, sample_ratio, rank, world_size)#

Load only the JSONL lines needed for this rank, avoiding full json.loads on skipped lines.

Handles sample_ratio and sharding so that each rank only parses and stores its own subset. The semantics match the original load-all-then-slice approach: 1. Apply sample_ratio (deterministic Random(42).sample) on the full index range. 2. Shard the resulting list with [rank::world_size].

Returns:

(parsed examples for this rank, total line count).

Return type:

tuple[list[dict], int]

nemo_automodel.components.datasets.vlm.datasets._collect_sample_stats(examples)#

Count images, videos, text-only samples and estimate token counts.

Token estimation mirrors the logic in LengthGroupedSampler._estimate_tokens:

  • Text tokens: uses pre-computed _text_tokens when present (written by scripts/precompute_tokens.py), otherwise falls back to chars // 3.

  • Media tokens: uses mm_inputs_meta image/video dimensions when present (populated by the precompute script), otherwise 500 per media item.

Returns:

dict with keys n_images, n_videos, n_text_only, n_text_tokens, n_media_tokens, n_missing_text_tokens, n_missing_mm_inputs_meta. n_text_tokens + n_media_tokens gives the best available estimate of total training tokens.

nemo_automodel.components.datasets.vlm.datasets._log_dataset_loading_summary(
timings,
wall_time,
total_samples,
rank=None,
)#

Print a visual summary of per-dataset loading times and data statistics.

class nemo_automodel.components.datasets.vlm.datasets._ExamplesWithStats#

Bases: list

list subclass that carries pre-computed dataset statistics.

Attached by :func:make_meta_dataset so downstream code (e.g. _log_global_dataset_stats) can read aggregated stats without re-scanning all examples.

Initialization

Initialize self. See help(type(self)) for accurate signature.

__slots__#

(‘stats’,)

nemo_automodel.components.datasets.vlm.datasets.make_meta_dataset(
path_or_dataset,
dataset_names=None,
split='train',
shard_data=False,
rank=None,
world_size=None,
**kwargs,
)#

Load datasets defined in a meta JSON file and convert to Automodel conversation format.

The meta JSON file maps dataset names to their configurations. Each configuration can have: - file_name (str): Path to the data file (JSON/JSONL). Relative paths are resolved against the meta file’s directory. - columns (dict): Column name mapping (messages, images, videos). - tags (dict): Tag mapping (role_tag, content_tag, user_tag, assistant_tag). - media_dir (str): Directory prefix for media files. - sample_ratio (float): Sampling ratio (0.0 to 1.0, default 1.0).

When shard_data=True, each rank loads only its 1/world_size slice of every dataset file (interleaved: raw_data[rank::world_size]). This reduces per-rank memory and I/O. The caller should use a local sampler (e.g. RandomSampler) instead of DistributedSampler since data is already partitioned.

Video frame sampling (fps, min_frames, max_frames) should be configured on the processor rather than here. For example in YAML::

processor:
  _target_: transformers.AutoProcessor.from_pretrained
  pretrained_model_name_or_path: ...
  fps: 1
  min_frames: 4
  max_frames: 128

Example meta JSON::

{
    "my_dataset": {
        "file_name": "data/train.jsonl",
        "columns": {"messages": "conversations"},
        "media_dir": "/data/media"
    }
}
Parameters:
  • path_or_dataset (str) – Path to the meta JSON file.

  • dataset_names (list[str] | None) – Which datasets to load. None means all.

  • split (str) – Unused, kept for API consistency.

  • shard_data (bool) – If True, each rank loads only its 1/world_size slice.

  • rank (int | None) – Data-parallel rank. Inferred from torch.distributed if None.

  • world_size (int | None) – Data-parallel world size. Inferred from torch.distributed if None.

  • **kwargs – Additional arguments (unused).

Returns:

Combined list of examples in Automodel conversation format.

Return type:

list[dict]

class nemo_automodel.components.datasets.vlm.datasets.PreTokenizedDatasetWrapper(
dataset,
processor,
max_length=None,
max_retries=10,
)#

Bases: torch.utils.data.Dataset

Dataset wrapper that tokenizes samples in __getitem__.

Instead of deferring apply_chat_template to the collate function, this wrapper performs tokenization per-sample so that:

  • The collate function only needs to pad and stack.

  • Overlong samples are detected after precise tokenization (including media-token expansion) and replaced with a different random sample.

  • Tokenization work is distributed across DataLoader workers.

Each __getitem__ call returns a dict with at least::

{
    "input_ids":      (seq_len,),
    "attention_mask": (seq_len,),
    "labels":         (seq_len,),
}

Plus optional media tensors (pixel_values, image_grid_thw, pixel_values_videos, video_grid_thw).

Initialization

__len__()#
__getitem__(idx)#
robust_collate(collate_fn)#

Wrap collate_fn so that on failure the entire batch is re-sampled.

class nemo_automodel.components.datasets.vlm.datasets.RobustDatasetWrapper(dataset, max_retries: int = 10)#

Bases: torch.utils.data.Dataset

Wrapper that catches __getitem__ and collate errors, substituting random replacement samples.

This handles failures such as corrupted files, missing media, bad data, or processor errors (e.g. multimodal token mismatch from truncation) without crashing the entire training run.

Initialization

__len__()#
__getitem__(idx)#
robust_collate(collate_fn)#

Wrap a collate_fn so that on failure the entire batch is re-sampled and retried.