> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset

## Module Contents

### Classes

| Name                                                                                                                                                      | Description                                                          |
| --------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| [`ColumnMappedTextInstructionDataset`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-ColumnMappedTextInstructionDataset) | Generic instruction-tuning dataset that maps arbitrary column names. |
| [`ColumnTypes`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-ColumnTypes)                                               | Supported logical column roles for text instruction datasets.        |

### Functions

| Name                                                                                                                                              | Description                                                                     |
| ------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [`_check_all_values_equal_length`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-_check_all_values_equal_length) | Check if all values in the sample are of the same length.                       |
| [`_load_dataset`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-_load_dataset)                                   | Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files. |
| [`_str_is_hf_repo_id`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-_str_is_hf_repo_id)                         | Check if a string is a valid huggingface dataset id.                            |
| [`make_iterable`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-make_iterable)                                   | Utility that converts *val* into an iterator of strings.                        |

### Data

[`logger`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-logger)

### API

```python
class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    column_mapping: typing.Dict[str, str],
    tokenizer,
    split: typing.Optional[str] = 'train',
    name: typing.Optional[str] = None,
    answer_only_loss_mask: bool = True,
    seq_length: typing.Optional[int] = None,
    padding: typing.Union[str, bool] = 'do_not_pad',
    truncation: typing.Union[str, bool] = 'do_not_truncate',
    limit_dataset_samples: typing.Optional[int] = None,
    use_hf_chat_template: bool = False
)
```

**Bases:** `Dataset`

Generic instruction-tuning dataset that maps arbitrary column names.

The class is intentionally lightweight: it simply loads the raw samples
(either from HF or from local JSON/JSONL files) and remaps the columns so
that downstream components can rely on a consistent field interface.

Optionally, if *answer\_only\_loss\_mask* is requested, the dataset will also
compute a *loss\_mask* indicating which tokens should contribute to the
loss (typically only those belonging to the assistant answer).

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset.__getitem__(
    idx
)
```

Returns the item at the given index.

**Parameters:**

The index of the item to return.

**Returns:**

A dictionary with the mapped columns.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.List[int]]]
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset.__len__() -> int
```

Returns the length of the dataset.

**Returns:** `int`

The length of the dataset.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset._apply_tokenizer(
    sample: typing.Dict[str, str]
) -> typing.Dict[str, typing.List[int]]
```

Tokenize a mapped *sample* and compute auxiliary fields.

If the tokenizer is provided:

* If the tokenizer supports a chat template, the dataset will be tokenized in a conversation style.
* Otherwise, the dataset will be tokenized in a simple prompt-completion style.

**Parameters:**

A dictionary with the mapped columns.

**Returns:** `Dict[str, List[int]]`

A dictionary with the tokenized columns.

```python
class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnTypes
```

**Bases:** `enum.Enum`

Supported logical column roles for text instruction datasets.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._check_all_values_equal_length(
    sample: typing.Dict[str, typing.List[int]]
) -> bool
```

Check if all values in the sample are of the same length.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._load_dataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    split: typing.Optional[str] = None,
    streaming: bool = False,
    name: typing.Optional[str] = None
)
```

Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.

If *path\_or\_dataset\_id* resembles a HF repo ID (i.e. of the form
`org/dataset` and the path does **not** exist on the local filesystem),
we defer to `datasets.load_dataset` directly. Otherwise, we assume the
argument points to one or more local JSON/JSONL files and let
`datasets.load_dataset` with the *"json"* script handle the parsing.

**Parameters:**

Either a HF dataset identifier (`org/name`) or
a path / list of paths to local `.json` / `.jsonl` files.

Optional split to load when retrieving a remote dataset. This
parameter is ignored for local files as the *json* script always
returns a single split.

Whether to stream the dataset.

Optional name of the dataset configuration/subset to load

**Returns:**

datasets.Dataset: The loaded dataset.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._str_is_hf_repo_id(
    val: str
) -> bool
```

Check if a string is a valid huggingface dataset id.

**Parameters:**

A string to check.

**Returns:** `bool`

True if the string is a valid huggingface dataset id, False otherwise.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.make_iterable(
    val: typing.Union[str, typing.List[str]]
) -> typing.Iterator[str]
```

Utility that converts *val* into an iterator of strings.

The helper accepts either a single string or a list of strings and
yields its contents. This is handy when we want to treat the two cases
uniformly downstream (e.g. when iterating over *data\_files* that can be
provided as either a single path or a collection of paths).

**Parameters:**

Either a single string or a list/tuple of strings.

**Raises:**

* `ValueError`: If *val* is neither a string nor an iterable of strings.

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.logger = logging.getLogger(__name__)
```