nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset

Module Contents

Classes

Name	Description
`ColumnMappedTextInstructionIterableDataset`	Streaming iterable variant that reuses the column-mapping/tokenization logic.

Functions

Name	Description
`_load_streaming_dataset`	Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

Data

logger

API

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    column_mapping: typing.Dict[str, str],
    tokenizer,
    split: typing.Optional[str] = None,
    name: typing.Optional[str] = None,
    answer_only_loss_mask: bool = True,
    seq_length: typing.Optional[int] = None,
    padding: typing.Union[str, bool] = 'do_not_pad',
    truncation: typing.Union[str, bool] = 'do_not_truncate',
    start_of_turn_token: typing.Optional[str] = None,
    limit_dataset_samples: typing.Optional[int] = None,
    repeat_on_exhaustion: bool = True,
    use_hf_chat_template: bool = False,
    delta_storage_options: typing.Optional[typing.Dict[str, str]] = None,
    delta_version: typing.Optional[int] = None,
    delta_sql_query: typing.Optional[str] = None
)

Bases: IterableDataset, ColumnMappedTextInstructionDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

This wraps a Hugging Face streaming dataset (IterableDataset from datasets) or Delta Lake table and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.

Supports the following data sources:

HuggingFace Hub datasets
Local JSON/JSONL files
Delta Lake tables (via delta://, dbfs:/, or local directories with _delta_log)

_current_epoch_for_repeat

= 0

num_shards

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__getitem__(
    idx: int
) -> typing.Dict[str, typing.List[int]]

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.List[int]]]

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__len__() -> int

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.set_epoch(
    epoch: int
) -> None

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.shard(
    num_shards: int,
    index: int
)

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.shuffle(
    buffer_size: int = 1000,
    seed: typing.Optional[int] = None
)

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset._load_streaming_dataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    split: typing.Optional[str] = None,
    streaming: bool = False,
    name: typing.Optional[str] = None,
    delta_storage_options: typing.Optional[typing.Dict[str, str]] = None,
    delta_version: typing.Optional[int] = None,
    delta_sql_query: typing.Optional[str] = None
)

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. If the path is a Delta Lake table (prefixed with delta://, dbfs:/, or a directory containing _delta_log), we load using the Delta Lake reader. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:

path_or_dataset_id

Union[str, List[str]]

Either a HF dataset identifier (org/name), a Delta Lake table path (delta://path/to/table), or a path / list of paths to local .json / .jsonl files.

split

Optional[str]Defaults to None

Optional split to load when retrieving a remote dataset. This parameter is ignored for local files and Delta Lake tables.

streaming

boolDefaults to False

Whether to stream the dataset.

name

Optional[str]Defaults to None

Optional name of the dataset configuration/subset to load

delta_storage_options

Optional[Dict[str, str]]Defaults to None

Optional dict of storage options for Delta Lake cloud authentication (e.g., {"DATABRICKS_TOKEN": "dapi..."})

delta_version

Optional[int]Defaults to None

Optional specific version of the Delta table to read.

delta_sql_query

Optional[str]Defaults to None

Optional SQL query to execute against the Delta Lake source. This is supported when running with a SparkSession (Databricks / pyspark) or when using the Databricks SQL Connector. The query must return the columns expected by column_mapping.

Returns:

datasets.Dataset: The loaded dataset.

Examples:

>>> # Load from HuggingFace Hub
>>> ds = _load_dataset("org/dataset", split="train")

>>> # Load from local Delta Lake table
>>> ds = _load_dataset("delta:///path/to/delta_table", streaming=True)

>>> # Load from Databricks with authentication
>>> ds = _load_dataset(
...     "delta://catalog.schema.table",
...     delta_storage_options={"DATABRICKS_TOKEN": "dapi..."},
...     streaming=True,
... )

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger = logging.getLogger(__name__)