Integrate Your Own Multi-Modal Dataset#

This guide shows you how to integrate your own dataset into NeMo Automodel for training. You’ll learn about multi-modal datasets that combine text with images or other modalities. We’ll cover how to create custom datasets by implementing the required methods and preprocessing functions, and finally show you how to specify your own data logic using YAML configuration with file paths—allowing you to define custom dataset processing without modifying the main codebase.

Quick Start Summary#

Type	Use Case	Example	Preprocessor	Section
🖼️ Multi-modal	Vision + Language	MedPix-VQA	`apply_chat_template`, collate fn	Jump
🎤 Audio	Speech + Language	Common Voice 17	`apply_chat_template`, collate fn	Jump

Audio Datasets#

Audio datasets combine speech input with text transcriptions and are essential for training models capable of speech recognition and transcription tasks. NeMo Automodel supports audio dataset integration through specialized preprocessing functions and custom collate functions for multimodal models like Phi-4.

Example: Common Voice 17 Dataset#

The Common Voice 17 dataset is a comprehensive speech recognition dataset containing audio clips and corresponding transcriptions. This particular version focuses on Turkish speech data and has been preprocessed and fixed for compatibility with modern training frameworks.

Structure:

Audio: Speech recordings in various formats
Transcription: Text transcriptions of the spoken content
Use case: Speech-to-text transcription for multimodal models

{
  "audio": {
    "path": "common_voice_tr_17528071.mp3",
    "array": [-0.1600779, -0.13843077],
    "sampling_rate": 16000
  },
  "transcription": "Kosova başkentinswki yolcu sayısı arttı."
}

The example dataset preprocessing performs the following steps:

Loads the dataset using Hugging Face’s datasets library.
Extracts the audio and transcription fields.

For more detailed examples of how to process multi-modal datasets, see the examples in datasets.py.

Collate Functions#

NeMo Automodel provides specialized collate functions for different VLM processors. The collate function is responsible for batching examples and preparing them for model input.

Multi-modal models require custom collate functions to batch and process each sample correctly. If your model uses a Hugging Face AutoProcessor, you can use it directly. Otherwise, you can define your own collate logic and point to it in your YAML config. We provide example custom collate functions that you can use as references for your implementation. After you implement your own collate function, you can specify it in your YAML config.

YAML-based Custom Dataset Configuration#

NeMo Automodel supports YAML-based dataset specification using the target key. This lets you reference dataset-building classes or functions using either:

1. Python Dotted Path

dataset:
  _target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
  path_or_dataset: rowan/hellaswag
  split: train

1. File Path + Function Name

<file-path>:<function-name>

Where:

<file-path>: The absolute path to a Python file containing your dataset function
<function-name>: The name of the function to call from that file

dataset:
  _target_: /path/to/your/custom_dataset.py:build_my_dataset
  num_blocks: 111

This will call build_my_dataset() from the specified file with the other keys (e.g., num_blocks) as arguments. This approach allows you to integrate custom datasets via config alone—no need to alter the codebase or package structure.

Troubleshooting Tips#

Tokenization Mismatch? Ensure your tokenizer aligns with the model’s expected inputs.
Dataset too large? Use limit_dataset_samples in your YAML config to load a subset, useful for quick debugging.
Loss not decreasing? Verify that your loss mask correctly ignores prompt tokens.