> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset

## Module Contents

### Classes

| Name                                                                                                                                                                               | Description                                                                   |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`ColumnMappedTextInstructionIterableDataset`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_iterable_dataset-ColumnMappedTextInstructionIterableDataset) | Streaming iterable variant that reuses the column-mapping/tokenization logic. |

### Functions

| Name                                                                                                                                         | Description                                                                        |
| -------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| [`_load_streaming_dataset`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_iterable_dataset-_load_streaming_dataset) | Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables. |

### Data

[`logger`](#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_iterable_dataset-logger)

### API

```python
class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    column_mapping: typing.Dict[str, str],
    tokenizer,
    split: typing.Optional[str] = None,
    name: typing.Optional[str] = None,
    answer_only_loss_mask: bool = True,
    seq_length: typing.Optional[int] = None,
    padding: typing.Union[str, bool] = 'do_not_pad',
    truncation: typing.Union[str, bool] = 'do_not_truncate',
    start_of_turn_token: typing.Optional[str] = None,
    limit_dataset_samples: typing.Optional[int] = None,
    repeat_on_exhaustion: bool = True,
    use_hf_chat_template: bool = False,
    delta_storage_options: typing.Optional[typing.Dict[str, str]] = None,
    delta_version: typing.Optional[int] = None,
    delta_sql_query: typing.Optional[str] = None
)
```

**Bases:** `IterableDataset`, [ColumnMappedTextInstructionDataset](/nemo-automodel/nemo_automodel/components/datasets/llm/column_mapped_text_instruction_dataset#nemo_automodel-components-datasets-llm-column_mapped_text_instruction_dataset-ColumnMappedTextInstructionDataset)

Streaming iterable variant that reuses the column-mapping/tokenization logic.

This wraps a Hugging Face streaming dataset (IterableDataset from `datasets`)
or Delta Lake table and yields tokenized samples compatible with the non-streaming
variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.

Supports the following data sources:

* HuggingFace Hub datasets
* Local JSON/JSONL files
* Delta Lake tables (via delta://, dbfs\:/, or local directories with \_delta\_log)

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__getitem__(
    idx: int
) -> typing.Dict[str, typing.List[int]]
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.List[int]]]
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__len__() -> int
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.set_epoch(
    epoch: int
) -> None
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.shard(
    num_shards: int,
    index: int
)
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.shuffle(
    buffer_size: int = 1000,
    seed: typing.Optional[int] = None
)
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset._load_streaming_dataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    split: typing.Optional[str] = None,
    streaming: bool = False,
    name: typing.Optional[str] = None,
    delta_storage_options: typing.Optional[typing.Dict[str, str]] = None,
    delta_version: typing.Optional[int] = None,
    delta_sql_query: typing.Optional[str] = None
)
```

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

If *path\_or\_dataset\_id* resembles a HF repo ID (i.e. of the form
`org/dataset` and the path does **not** exist on the local filesystem),
we defer to `datasets.load_dataset` directly. If the path is a Delta Lake
table (prefixed with `delta://`, `dbfs:/`, or a directory containing
`_delta_log`), we load using the Delta Lake reader. Otherwise, we assume
the argument points to one or more local JSON/JSONL files and let
`datasets.load_dataset` with the *"json"* script handle the parsing.

**Parameters:**

Either a HF dataset identifier (`org/name`),
a Delta Lake table path (`delta://path/to/table`), or
a path / list of paths to local `.json` / `.jsonl` files.

Optional split to load when retrieving a remote dataset. This
parameter is ignored for local files and Delta Lake tables.

Whether to stream the dataset.

Optional name of the dataset configuration/subset to load

Optional dict of storage options for Delta Lake
cloud authentication (e.g., `&#123;"DATABRICKS_TOKEN": "dapi..."&#125;`)

Optional specific version of the Delta table to read.

Optional SQL query to execute against the Delta Lake source.
This is supported when running with a SparkSession (Databricks / pyspark)
or when using the Databricks SQL Connector. The query must return the
columns expected by `column_mapping`.

**Returns:**

datasets.Dataset: The loaded dataset.

**Examples:**

```python
>>> # Load from HuggingFace Hub
>>> ds = _load_dataset("org/dataset", split="train")
```

```python
>>> # Load from local Delta Lake table
>>> ds = _load_dataset("delta:///path/to/delta_table", streaming=True)
```

```python
>>> # Load from Databricks with authentication
>>> ds = _load_dataset(
...     "delta://catalog.schema.table",
...     delta_storage_options={"DATABRICKS_TOKEN": "dapi..."},
...     streaming=True,
... )
```

```python
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger = logging.getLogger(__name__)
```