Dataset Overview: LLM and VLM Datasets in NeMo Automodel#
This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML _target_ mechanism.
See also: LLM datasets and VLM datasets for deeper, task-specific guides.
If a dataset you need is missing, please open a GitHub issue with a short description and example schema so we can prioritize support.
LLM Datasets#
NeMo Automodel supports several common patterns for language modeling and instruction tuning.
HellaSwag (completion SFT)
Wrapper:
nemo_automodel.components.datasets.llm.hellaswag.HellaSwagUse case: single-turn completion style SFT where a prompt (ctx) is followed by a gold continuation (ending)
Key args:
path_or_dataset,split,num_samples_limitExample YAML:
dataset:
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
path_or_dataset: rowan/hellaswag
split: train
SQuAD-style Question Answering (QA) (instruction SFT)
Factory:
nemo_automodel.components.datasets.llm.squad.make_squad_datasetUse case: instruction/QA tuning with either prompt+answer formatting or chat-template formatting
Notes:
If the tokenizer has a chat template and you want answer-only loss, you must provide
start_of_turn_token.Optional
seq_lengthcan be used for padding/truncation.
Example YAML:
dataset:
_target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
split: train
dataset_name: rajpurkar/squad
start_of_turn_token: "<|assistant|>"
ColumnMappedTextInstructionDataset (generic instruction SFT)
Class:
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDatasetUse case: quickly adapt instruction datasets by mapping your schema’s columns to
context,question,answerSources: local JSON/JSONL or Hugging Face Hub dataset ID
Notes:
For tokenizers with chat templates and answer-only loss, you may set
answer_only_loss_mask: trueand providestart_of_turn_token.
Example YAML:
dataset:
_target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset
path_or_dataset_id: Muennighoff/natural-instructions
split: train
column_mapping:
context: definition
question: inputs
answer: targets
answer_only_loss_mask: true
start_of_turn_token: "<|assistant|>"
See the detailed guide, Column-Mapped Text Instruction Dataset, for more information.
ChatDataset (multi-turn conversations and tool calling)
Class:
nemo_automodel.components.datasets.llm.ChatDatasetUse case: multi-turn conversations and tool calling in OpenAI chat format
Sources: local JSON/JSONL or Hugging Face Hub dataset ID
Key args:
path_or_dataset_id: path to local file(s) or HuggingFace dataset IDtokenizer: tokenizer instance (required. Must have chat template support)split: dataset split (e.g., “train”, “validation”)name: dataset configuration/subset nameseq_length: maximum sequence length for padding/truncationpadding: padding strategy (“do_not_pad”, “max_length”, etc.)truncation: truncation strategy (“do_not_truncate”, “longest_first”, etc.)start_of_turn_token: token marking assistant response start (for answer-only loss)chat_template: optional override for tokenizer’s chat template
Notes:
Requires a tokenizer with chat template support
Supports both single-turn and multi-turn tool calling
Tool definitions are provided in a
toolsfield at the conversation levelTool calls appear in assistant messages via
tool_callsfieldTool responses use the
toolrole
Example YAML:
dataset:
_target_: nemo_automodel.components.datasets.llm.ChatDataset
path_or_dataset_id: Salesforce/xlam-function-calling-60k
split: train
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: google/functiongemma-270m-it
seq_length: 2048
start_of_turn_token: "<start_of_turn>"
Expected data format (OpenAI messages format):
{
"messages": [
{
"role": "user",
"content": "What's the weather in Seattle?"
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"Seattle\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_1",
"content": "{\"temperature\": 65, \"condition\": \"cloudy\"}"
},
{
"role": "assistant",
"content": "It's 65°F and cloudy in Seattle."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
}
For single-turn tool calling (one tool call per conversation), omit the tool response and final assistant message:
{
"messages": [
{
"role": "user",
"content": "Book a table for two at 7pm in Seattle."
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {
"name": "book_table",
"arguments": "{\"party_size\": 2, \"time\": \"19:00\", \"city\": \"Seattle\"}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "book_table",
"description": "Book a restaurant table",
"parameters": {
"type": "object",
"properties": {
"party_size": {"type": "integer"},
"time": {"type": "string"},
"city": {"type": "string"}
}
}
}
}
]
}
See the Function Calling guide for an end-to-end example with FunctionGemma.
NanoGPT Binary Shards (pretraining)
Class:
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDatasetUse case: token-level LM pretraining over
.binshards produced by NanoGPT-style preprocessors (supports legacy and current formats)Notes:
Streams contiguous
seq_lenslices, supports optional BOS alignment and.bos.idxsidecar filesRelated tool:
tools/nanogpt_data_processor.py
Megatron (pretraining; interoperable with pre-tokenized Megatron data)
Class:
nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretrainingUse case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e.,
.bin/.idxpairs), you can point Automodel to those assets directly. No re-tokenization required.Key args:
paths(single path, glob, weighted list, or per-split dict),seq_length,tokenizer,split,index_mapping_dir,splits_to_buildExample YAML:
dataset:
_target_: nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining
paths: /abs/path/to/processed_data_*_text_document* # glob or explicit list
index_mapping_dir: /abs/path/to/mapping_dir
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: openai-community/gpt2
seq_length: 1024
split: "0.99, 0.01, 0.00" # train, validation, test
splits_to_build: "train"
See the detailed pretraining guide, which uses MegatronPretraining data.
Packed Sequence Support#
To reduce padding and improve throughput with variable-length sequences:
packed_sequence:
packed_sequence_size: 8192 # > 0 enables packing
split_across_pack: false
Use a collater that pads to an FP8-friendly multiple when training with FP8:
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn:
_target_: nemo_automodel.components.datasets.utils.default_collater
pad_seq_len_divisible: 16
VLM Datasets (Vision/Audio + Language)#
VLM datasets are represented as conversations (message lists) that combine text with images or audio and are processed with the model’s AutoProcessor.apply_chat_template and a suitable collate function.
Built-in dataset makers (return lists of conversation dicts):
RDR items:
nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset(HF:quintend/rdr-items)CORD-V2 receipts (Consolidated Receipt Dataset for Post-OCR Parsing):
nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset(HF:naver-clova-ix/cord-v2)MedPix-VQA (Medical Pixel Question Answering):
nemo_automodel.components.datasets.vlm.datasets.make_medpix_datasetCommonVoice 17 (CV17) (audio):
nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset
Each example follows the conversation schema expected by apply_chat_template, e.g.:
{
"conversation": [
{
"role": "user",
"content": [
{"type": "image", "image": example_image},
{"type": "text", "text": "Describe this image."}
]
},
{
"role": "assistant",
"content": [{"type": "text", "text": ground_truth_text}]
}
]
}
Collate Functions#
nemo_automodel.components.datasets.vlm.collate_fns.default_collate_fnnemo_automodel.components.datasets.vlm.collate_fns.qwen2_5_collate_fn(Qwen2.5 VL)nemo_automodel.components.datasets.vlm.collate_fns.phi4_mm_collate_fn(audio)
Select in your YAML:
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
batch_size: 1
collate_fn:
_target_: nemo_automodel.components.datasets.vlm.collate_fns.qwen2_5_collate_fn
If you want answer-only loss masking, provide a model-appropriate start_of_response_token to the collate function.
See Gemma-3n and VLM dataset for end-to-end examples.
Bring Your Own Dataset#
You can integrate custom datasets with zero code changes to NeMo Automodel by using _target_ in YAML. There are three approaches:
1) Point to an existing class or function (dotted path)#
LLM example (class):
dataset:
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
path_or_dataset: rowan/hellaswag
split: train
LLM example (factory function):
dataset:
_target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
split: train
dataset_name: rajpurkar/squad
VLM example (factory function):
dataset:
_target_: nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset
split: train
2) Point to a local Python file and function#
dataset:
_target_: /abs/path/to/my_custom_dataset.py:build_my_dataset
some_arg: 123
split: train
Where build_my_dataset returns either a datasets.Dataset or a list/iterator of conversation dicts (for VLM).
3) Use ColumnMappedTextInstructionDataset for most instruction datasets (LLM)#
Ideal when your data has columns like
instruction,input, oroutputbut with arbitrary namesSupports local JSON/JSONL and HF Hub
dataset:
_target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset
path_or_dataset_id: /abs/path/to/*.jsonl # or org/repo on HF
column_mapping:
context: definition
question: inputs
answer: targets
answer_only_loss_mask: true
start_of_turn_token: "<|assistant|>"
Minimal Custom Class Pattern (LLM Completion)#
If you prefer Python, implement get_context and get_target and reuse the built-in preprocessor:
from datasets import load_dataset
from nemo_automodel.components.datasets.utils import SFTSingleTurnPreprocessor
class MyCompletionDataset:
def __init__(self, path_or_dataset, tokenizer, split="train"):
raw_ds = load_dataset(path_or_dataset, split=split)
self.dataset = SFTSingleTurnPreprocessor(tokenizer).process(raw_ds, self)
def get_context(self, examples):
return examples["my_context_field"]
def get_target(self, examples):
return examples["my_target_field"]
Then reference your class via _target_ in YAML.
Important Considerations#
Chat templates: If your tokenizer has a chat template and you want answer-only loss, provide the correct
start_of_turn_token(LLM) orstart_of_response_token(VLM collate functions).Padding for FP8: If training with FP8, set
pad_seq_len_divisible: 16in your collate function to align sequence lengths.Packed sequences: Prefer packed sequences for throughput when fine-tuning LLMs on variable-length corpora.
Validation: You can define a separate
validation_datasetandvalidation_dataloaderblock mirroring your training config.
For detailed, end-to-end recipes, browse the example configs under examples/llm_finetune/, examples/llm_pretrain/, and examples/vlm_finetune/.