> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# Use the ColumnMappedTextInstructionDataset

This guide explains how to use `ColumnMappedTextInstructionDataset` to quickly and flexibly load instruction-answer datasets for LLM fine-tuning, with minimal code changes and support for common tokenization strategies.

The `ColumnMappedTextInstructionDataset` is a lightweight, plug-and-play helper that lets you train on instruction-answer style corpora without writing custom Python for every new schema. You simply specify which columns map to logical fields like `context`, `question`, and `answer`, and the loader handles the rest automatically. This enables:

* Quick prototyping across diverse instruction datasets
* Schema flexibility without requiring code changes
* Consistent field names for training loops, regardless of dataset source

`ColumnMappedTextInstructionDataset` is a **map-style** dataset (`torch.utils.data.Dataset`): it supports `len(ds)` and `ds[i]`, and it loads data **non-streaming**.

It supports two data sources out-of-the-box:

1. **Local JSON/JSONL files** - pass a single file path or a list of paths on disk. Newline-delimited JSON works great.
2. **Hugging Face Hub** - point to any dataset repo (`org/dataset`) that contains the required columns.

For **streaming** (including **Delta Lake / Databricks**), use [`ColumnMappedTextInstructionIterableDataset`](/datasets/columnmapped-iterable). The iterable variant always streams by design to avoid accidentally materializing entire datasets to disk/memory.

***

## Quickstart

The fastest way to sanity-check the loader is to point it at an existing Hugging Face dataset and print the first sample. This section provides a minimal, runnable example to help you quickly try out the dataset.

```python
from transformers import AutoTokenizer
from nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset import ColumnMappedTextInstructionDataset

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

ds = ColumnMappedTextInstructionDataset(
    path_or_dataset_id="Muennighoff/natural-instructions",
    column_mapping={
      "context": "definition",
      "question": "inputs",
      "answer": "targets"
    },
    tokenizer=tokenizer,
    answer_only_loss_mask=True,
)

sample = ds[0]
print(sample.keys())

# Typical keys include: input_ids, labels, attention_mask (and an internal ___PAD_TOKEN_IDS___ helper).
# Note: when answer_only_loss_mask=True, prompt tokens are masked in labels with -100
# (the standard CrossEntropy "ignore_index").
```

The code above is intended only for a quick sanity check of the dataset and its tokenization output. For training or production use, configure the dataset using YAML as shown below. YAML offers a reproducible, maintainable, and scalable way to specify dataset and tokenization settings.

***

## Usage Examples

This section provides practical usage examples, including how to load remote datasets, work with local files, and configure pipelines using YAML recipes.

### Local JSONL Example

Assume you have a local newline-delimited JSON file at `/data/my_corpus.jsonl`
with the simple schema `{instruction, output}`. A few sample rows:

```json
{"instruction": "Translate 'Hello' to French", "output": "Bonjour"}
{"instruction": "Summarize the planet Neptune.", "output": "Neptune is the eighth planet from the Sun."}
```

You can load it using Python code like:

```python
local_ds = ColumnMappedTextInstructionDataset(
    path_or_dataset_id=["/data/my_corpus_1.jsonl", "/data/my_corpus_2.jsonl"], # can also be a single path (string)
    column_mapping={
        "question": "instruction",
        "answer": "output",
    },
    tokenizer=tokenizer,
    answer_only_loss_mask=False,  # compute loss over full sequence
)

print(local_ds[0].keys())   # dict_keys(['input_ids', 'labels', 'attention_mask', '___PAD_TOKEN_IDS___'])
```

You can configure the dataset entirely from your recipe YAML. For example:

```yaml
dataset:
  _target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset
  path_or_dataset_id:
    - /data/my_corpus_1.jsonl
    - /data/my_corpus_2.jsonl
  column_mapping:
    question: instruction
    answer: output
  answer_only_loss_mask: false
```

### Remote Dataset Example

In the following section, we demonstrate how to load the instruction-tuning corpus
[`Muennighoff/natural-instructions`](https://huggingface.co/datasets/Muennighoff/natural-instructions).
The dataset schema is `{task_name, id, definition, inputs, targets}`.

The following are examples from the training split:

```json
{
  "task_name": "task001_quoref_question_generation",
  "id": "task001-abc123",
  "definition": "In this task, you're given passages that...",
  "inputs": "Passage: A man is sitting at a piano...",
  "targets": "What is the first name of the person who doubted it would be explosive?"
}
{
  "task_name": "task002_math_word_problems",
  "id": "task002-def456",
  "definition": "Solve the following word problem.",
  "inputs": "If there are 3 apples and you take 2...",
  "targets": "1"
}
```

For basic QA fine-tuning, we usually map `definition → context`, `inputs → question`, and `targets → answer` as follows:

```python
from nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset import (
    ColumnMappedTextInstructionDataset,
)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

remote_ds = ColumnMappedTextInstructionDataset(
    path_or_dataset_id="Muennighoff/natural-instructions",  # Hugging Face repo ID
    column_mapping={
        "context": "definition",  # high-level context
        "question": "inputs",      # the actual prompt / input
        "answer": "targets",       # expected answer string
    },
    tokenizer=tokenizer,
    split="train[:5%]",        # demo slice; omit (i.e., `split="train",`) for full data
    answer_only_loss_mask=True,
)
```

You can configure the entire dataset directly from your recipe YAML. For example:

```yaml
# dataset section of your recipe's config.yaml
dataset:
  _target_: nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset
  path_or_dataset_id: Muennighoff/natural-instructions
  split: train
  column_mapping:
    context: definition
    question: inputs
    answer: targets
  answer_only_loss_mask: true
```

### Streaming / Delta Lake / Databricks

`ColumnMappedTextInstructionDataset` does not support streaming or Delta Lake / Databricks sources. For those, use [`ColumnMappedTextInstructionIterableDataset`](/datasets/columnmapped-iterable).

Delta Lake / Databricks (including `delta_sql_query` and authentication) is supported only by `ColumnMappedTextInstructionIterableDataset`. See [`column-mapped-text-instruction-iterable-dataset.md`](/datasets/columnmapped-iterable) for details.

### Advanced Options

| Arg                     | Default             | Description                                                                                                                                   |
| ----------------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `split`                 | `"train"`           | Which split to pull from a HF repo (`train`, `validation`, etc.). Ignored for local JSON/JSONL.                                               |
| `name`                  | `None`              | Name of the Hugging Face dataset configuration/subset to load.                                                                                |
| `answer_only_loss_mask` | `True`              | Mask prompt tokens in `labels` with `-100` (the standard CrossEntropy `ignore_index`).                                                        |
| `use_hf_chat_template`  | `False`             | If `True` and the tokenizer supports chat templates, format as a system/user/assistant conversation via `tokenizer.apply_chat_template(...)`. |
| `seq_length`            | `None`              | Optional max sequence length; used for padding/truncation when enabled.                                                                       |
| `padding`               | `"do_not_pad"`      | Padding strategy passed to the tokenizer (`"do_not_pad"`, `"max_length"`, `True`, etc.).                                                      |
| `truncation`            | `"do_not_truncate"` | Truncation strategy passed to the tokenizer (`"do_not_truncate"`, `True`, etc.).                                                              |
| `limit_dataset_samples` | `None`              | Optionally load only the first (N) samples (useful for debugging).                                                                            |

***

## Tokenization Paths

This section explains how the dataset formats and tokenizes samples.

`ColumnMappedTextInstructionDataset` produces standard next-token training tensors:

* `input_ids`
* `labels`
* `attention_mask`

When `answer_only_loss_mask=True`, prompt tokens are masked in `labels` with `-100` (the standard CrossEntropy `ignore_index`).

The dataset supports two formatting paths:

1. **Chat-template path (opt-in)**: if `use_hf_chat_template=True` and the tokenizer exposes a `chat_template` and `apply_chat_template`, the dataset builds messages like:

   `[{"role": "system", "content": <context or "">}, {"role": "user", "content": <question or "">}, {"role": "assistant", "content": <answer>}]`

   and tokenizes them via `tokenizer.apply_chat_template(..., tokenize=True, return_dict=True)`.

2. **Plain prompt/completion path (default)**: otherwise the dataset concatenates prompt and answer and tokenizes the result.

In both cases, `labels` are the next-token targets (shifted by one relative to `input_ids`). The dataset also includes an internal `___PAD_TOKEN_IDS___` field used downstream for padding.

***

## Parameter Requirements

The following section lists important requirements and caveats for correct usage.

* `column_mapping` must include `answer`, and must include at least one of `context` or `question` (2- or 3-column mapping only).
* If `use_hf_chat_template=True`, the tokenizer must support chat templates (`chat_template` + `apply_chat_template`).

***

## Slurm Configuration for Distributed Training

For distributed training on Slurm clusters, add a `slurm` section to your YAML configuration. This section configures the Slurm batch job parameters and automatically generates the appropriate `#SBATCH` directives.

### Slurm Configuration

SLURM jobs are submitted with `sbatch` directly — no YAML section needed.
Copy the reference script, set the `CONFIG` variable to your YAML, and submit:

```sh
cp slurm.sub my_cluster.sub
# Edit my_cluster.sub — change CONFIG, #SBATCH directives, container, mounts, etc.
sbatch my_cluster.sub
```

All cluster-specific settings (nodes, GPUs, partition, container, mounts, secrets)
live in your sbatch script. See the [cluster guide](/job-launchers/slurm-cluster) for
full examples (Pyxis, bare-metal, Apptainer).

### Multi-Node Slurm Configuration

**Multi-Node Training**: When using Hugging Face datasets in multi-node setups, you need shared storage accessible by all nodes. Set `HF_DATASETS_CACHE` to a shared directory in your sbatch script (e.g., `export HF_DATASETS_CACHE=/shared/hf_cache`) to ensure all nodes can access the cached datasets.

When using multiple nodes with Hugging Face datasets:

1. **Shared Storage**: Ensure all nodes can access the same storage paths
2. **HF Cache**: Export `HF_HOME` and `HF_DATASETS_CACHE` in your sbatch script pointing to shared directories
3. **Mounts**: Add shared directories as container mounts in your sbatch script

Configure all of this in your sbatch script (`my_cluster.sub`), not in the YAML.

***

### That's It!

With the mapping specified, the rest of the NeMo Automodel pipeline (pre-tokenization, packing, collate-fn, *etc.*) works as usual.