Use the ColumnMappedTextInstructionDataset
This guide explains how to use ColumnMappedTextInstructionDataset to quickly and flexibly load instruction-answer datasets for LLM fine-tuning, with minimal code changes and support for common tokenization strategies.
The ColumnMappedTextInstructionDataset is a lightweight, plug-and-play helper that lets you train on instruction-answer style corpora without writing custom Python for every new schema. You simply specify which columns map to logical fields like context, question, and answer, and the loader handles the rest automatically. This enables:
- Quick prototyping across diverse instruction datasets
- Schema flexibility without requiring code changes
- Consistent field names for training loops, regardless of dataset source
ColumnMappedTextInstructionDataset is a map-style dataset (torch.utils.data.Dataset): it supports len(ds) and ds[i], and it loads data non-streaming.
It supports two data sources out-of-the-box:
- Local JSON/JSONL files - pass a single file path or a list of paths on disk. Newline-delimited JSON works great.
- Hugging Face Hub - point to any dataset repo (
org/dataset) that contains the required columns.
For streaming (including Delta Lake / Databricks), use ColumnMappedTextInstructionIterableDataset. The iterable variant always streams by design to avoid accidentally materializing entire datasets to disk/memory.
Quickstart
The fastest way to sanity-check the loader is to point it at an existing Hugging Face dataset and print the first sample. This section provides a minimal, runnable example to help you quickly try out the dataset.
The code above is intended only for a quick sanity check of the dataset and its tokenization output. For training or production use, configure the dataset using YAML as shown below. YAML offers a reproducible, maintainable, and scalable way to specify dataset and tokenization settings.
Usage Examples
This section provides practical usage examples, including how to load remote datasets, work with local files, and configure pipelines using YAML recipes.
Local JSONL Example
Assume you have a local newline-delimited JSON file at /data/my_corpus.jsonl
with the simple schema {instruction, output}. A few sample rows:
You can load it using Python code like:
You can configure the dataset entirely from your recipe YAML. For example:
Remote Dataset Example
In the following section, we demonstrate how to load the instruction-tuning corpus
Muennighoff/natural-instructions.
The dataset schema is {task_name, id, definition, inputs, targets}.
The following are examples from the training split:
For basic QA fine-tuning, we usually map definition → context, inputs → question, and targets → answer as follows:
You can configure the entire dataset directly from your recipe YAML. For example:
Streaming / Delta Lake / Databricks
ColumnMappedTextInstructionDataset does not support streaming or Delta Lake / Databricks sources. For those, use ColumnMappedTextInstructionIterableDataset.
Delta Lake / Databricks (including delta_sql_query and authentication) is supported only by ColumnMappedTextInstructionIterableDataset. See column-mapped-text-instruction-iterable-dataset.md for details.
Advanced Options
Tokenization Paths
This section explains how the dataset formats and tokenizes samples.
ColumnMappedTextInstructionDataset produces standard next-token training tensors:
input_idslabelsattention_mask
When answer_only_loss_mask=True, prompt tokens are masked in labels with -100 (the standard CrossEntropy ignore_index).
The dataset supports two formatting paths:
-
Chat-template path (opt-in): if
use_hf_chat_template=Trueand the tokenizer exposes achat_templateandapply_chat_template, the dataset builds messages like:[{"role": "system", "content": <context or "">}, {"role": "user", "content": <question or "">}, {"role": "assistant", "content": <answer>}]and tokenizes them via
tokenizer.apply_chat_template(..., tokenize=True, return_dict=True). -
Plain prompt/completion path (default): otherwise the dataset concatenates prompt and answer and tokenizes the result.
In both cases, labels are the next-token targets (shifted by one relative to input_ids). The dataset also includes an internal ___PAD_TOKEN_IDS___ field used downstream for padding.
Parameter Requirements
The following section lists important requirements and caveats for correct usage.
column_mappingmust includeanswer, and must include at least one ofcontextorquestion(2- or 3-column mapping only).- If
use_hf_chat_template=True, the tokenizer must support chat templates (chat_template+apply_chat_template).
Slurm Configuration for Distributed Training
For distributed training on Slurm clusters, add a slurm section to your YAML configuration. This section configures the Slurm batch job parameters and automatically generates the appropriate #SBATCH directives.
Slurm Configuration
SLURM jobs are submitted with sbatch directly — no YAML section needed.
Copy the reference script, set the CONFIG variable to your YAML, and submit:
All cluster-specific settings (nodes, GPUs, partition, container, mounts, secrets) live in your sbatch script. See the cluster guide for full examples (Pyxis, bare-metal, Apptainer).
Multi-Node Slurm Configuration
Multi-Node Training: When using Hugging Face datasets in multi-node setups, you need shared storage accessible by all nodes. Set HF_DATASETS_CACHE to a shared directory in your sbatch script (e.g., export HF_DATASETS_CACHE=/shared/hf_cache) to ensure all nodes can access the cached datasets.
When using multiple nodes with Hugging Face datasets:
- Shared Storage: Ensure all nodes can access the same storage paths
- HF Cache: Export
HF_HOMEandHF_DATASETS_CACHEin your sbatch script pointing to shared directories - Mounts: Add shared directories as container mounts in your sbatch script
Configure all of this in your sbatch script (my_cluster.sub), not in the YAML.
That’s It!
With the mapping specified, the rest of the NeMo Automodel pipeline (pre-tokenization, packing, collate-fn, etc.) works as usual.