Integrate Your Own Text Dataset

This guide shows you how to integrate your own dataset into NeMo Automodel for training. You’ll learn about two main dataset types: completion datasets for language modeling (like HellaSwag) and instruction datasets for question-answering tasks (like SQuAD). We’ll cover how to create custom datasets by implementing the required methods and preprocessing functions, and finally show you how to specify your own data logic using YAML configuration with file paths—allowing you to define custom dataset processing without modifying the main codebase.

Quick Start Summary

Type	Use Case	Example	Preprocessor	Section
✍️ Completion	Language modeling	HellaSwag	`SFTSingleTurnPreprocessor`	Jump
🗣️ Instruction	Question answering	SQuAD	`make_*` function	Jump

Types of Supported Datasets

NeMo Automodel supports a variety of datasets, depending on the task.

Completion Datasets

Completion datasets are single text sequences designed for language modeling where the model learns to predict the next token given a context. These datasets typically contain a context (prompt) and a target (completion) that the model should learn to generate.

Example: HellaSwag

The HellaSwag dataset is a popular completion dataset used for commonsense reasoning. It contains situations with multiple-choice endings where the model must choose the most plausible continuation.

HellaSwag dataset structure:

Context (ctx): A situation or scenario description
Endings: Multiple possible completions (4 options)
Label: Index of the correct ending

Example:

Context: "A man is sitting at a piano in a large room."
Endings: [
  "He starts playing a beautiful melody.",
  "He eats a sandwich while sitting there.",
  "He suddenly becomes invisible.",
  "He transforms into a robot."
]
Label: 0  # First ending is correct

Preprocessing with SFTSingleTurnPreprocessor

NeMo Automodel provides the SFTSingleTurnPreprocessor class to handle completion datasets. This processor:

Extracts context and target using get_context() and get_target().
Tokenizes and cleans context and target separately.
Concatenates them into one sequence.
Creates loss mask: -100 for context, target IDs for target.
Pads sequences to equal length.

Create Your Own Completion Dataset

To adapt your dataset into this format, define a class like this:

1 from datasets import load_dataset
2 from nemo_automodel.components.datasets.utils import SFTSingleTurnPreprocessor
3 
4 class MyCompletionDataset:
5     def __init__(self, path_or_dataset, tokenizer, split="train"):
6         raw_datasets = load_dataset(path_or_dataset, split=split)
7         processor = SFTSingleTurnPreprocessor(tokenizer)
8         self.dataset = processor.process(raw_datasets, self)
9 
10     def get_context(self, examples):
11         """Extract context/prompt from your dataset"""
12         return examples["context_field"]  # Replace with your context field
13 
14     def get_target(self, examples):
15         """Extract target/completion from your dataset"""
16         return examples["target_field"]   # Replace with your target field
17 
18     def __getitem__(self, index):
19         return self.dataset[index]
20 
21     def __len__(self):
22         return len(self.dataset)

Instruction Datasets

Instruction datasets are question-answer pairs where the model learns to respond to specific instructions or questions. These datasets are structured as context-question pairs with corresponding answers, making them ideal for teaching models to follow instructions and provide accurate responses.

Example: SQuAD

The SQuAD (Stanford Question Answering Dataset) is a popular instruction dataset for reading comprehension. It contains questions based on Wikipedia articles along with their answers.

SQuAD dataset structure:

Context: A paragraph of text from Wikipedia
Question: A question about the context
Answers: The correct answer with its position in the context

Create Your Own Instruction Dataset

The squad.py file contains the implementation for processing the SQuAD dataset into a format suitable for instruction tuning. It defines a dataset class and preprocessing functions that extract the context, question, and answer fields, concatenate them into a prompt-completion format, and apply tokenization, padding, and loss masking. This serves as a template for building custom instruction datasets by following a similar structure and adapting the extraction logic to your dataset’s schema.

Based on the SQuAD implementation in squad.py, you can create your own instruction dataset using the make_squad_dataset pattern:

1 from datasets import load_dataset
2 
3 def make_my_instruction_dataset(
4     tokenizer,
5     seq_length=None,
6     limit_dataset_samples=None,
7     split="train",
8     dataset_name="your-dataset-name",
9 ):
10     if limit_dataset_samples:
11         split = f"{split}[:{limit_dataset_samples}]"
12 
13     dataset = load_dataset(dataset_name, split=split)
14 
15     return dataset.map(
16         your_own_fmt_fn,  # Your formatting function
17         batched=False,
18         remove_columns=dataset.column_names,
19     )

YAML-based Custom Dataset Configuration

NeMo Automodel supports YAML-based dataset specification using the target key. This lets you reference dataset-building classes or functions using either:

1. Python Dotted Path

1 dataset:
2   _target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
3   path_or_dataset: rowan/hellaswag
4   split: train

1. File Path + Function Name

<file-path>:<function-name>

Where:

<file-path>: The absolute path to a Python file containing your dataset function
<function-name>: The name of the function to call from that file

1 dataset:
2   _target_: /path/to/your/custom_dataset.py:build_my_dataset
3   num_blocks: 111

This will call build_my_dataset() from the specified file with the other keys (e.g., num_blocks) as arguments. This approach allows you to integrate custom datasets via config alone—no need to alter the codebase or package structure.

Packed Sequence Support in NeMo AutoModel

NeMo AutoModel supports packed sequences, a technique to optimize training with variable-length sequences (e.g., text) by minimizing padding.

What is a Packed Sequence?

Instead of padding each sequence to a fixed length (wasting computation on [PAD] tokens), packed sequences:

Concatenate short sequences into a single continuous sequence.
Separate sequences with special tokens (e.g., [EOS]).
Track lengths via a “attention mask” to prevent cross-sequence information leakage.

Benefits

Reduces redundant computation on padding tokens leading to faster training.
Enables larger effective batch sizes leading to better GPU utilization.
Especially useful for language modeling and text datasets.

Enable Packed Sequences in NeMo Automodel

To enable packed sequences, add these keys to your recipe’s YAML config:

packed_sequence:
   # Set packed_sequence_size > 0 to run with packed sequences
   packed_sequence_size: 1024
   split_across_pack: False

The packed_sequence has two options:

packed_sequence_size: Defines the total token length of each packed sequence, higher values require higher GPU memory usage.
split_across_pack: If two will split a sequence across different packed sequences.

Troubleshooting Tips

Tokenization Mismatch? Ensure your tokenizer aligns with the model’s expected inputs.
Dataset too large? Use limit_dataset_samples in your YAML config to load a subset, useful for quick debugging.
Loss not decreasing? Verify that your loss mask correctly ignores prompt tokens.