`nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset`#

Module Contents#

Classes#

ColumnMappedTextInstructionIterableDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

Functions#

_load_streaming_dataset

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

Data#

logger

API#

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset._load_streaming_dataset( path_or_dataset_id: Union[str, List[str]], split: Optional[str] = None, streaming: bool = False, name: Optional[str] = None, delta_storage_options: Optional[Dict[str, str]] = None, delta_version: Optional[int] = None, delta_sql_query: Optional[str] = None, )#

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. If the path is a Delta Lake table (prefixed with delta://, dbfs:/, or a directory containing _delta_log), we load using the Delta Lake reader. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:

path_or_dataset_id – Either a HF dataset identifier (org/name), a Delta Lake table path (delta://path/to/table), or a path / list of paths to local .json / .jsonl files.
split – Optional split to load when retrieving a remote dataset. This parameter is ignored for local files and Delta Lake tables.
streaming – Whether to stream the dataset.
name – Optional name of the dataset configuration/subset to load
delta_storage_options – Optional dict of storage options for Delta Lake cloud authentication (e.g., {"DATABRICKS_TOKEN": "dapi..."})
delta_version – Optional specific version of the Delta table to read.
delta_sql_query – Optional SQL query to execute against the Delta Lake source. This is supported when running with a SparkSession (Databricks / pyspark) or when using the Databricks SQL Connector. The query must return the columns expected by column_mapping.

Returns:

The loaded dataset.

Return type:

datasets.Dataset

.. rubric:: Examples

Load from HuggingFace Hub

ds = _load_dataset(“org/dataset”, split=”train”)

Load from local Delta Lake table

ds = _load_dataset(“delta:///path/to/delta_table”, streaming=True)

Load from Databricks with authentication

ds = _load_dataset( … “delta://catalog.schema.table”, … delta_storage_options={“DATABRICKS_TOKEN”: “dapi…”}, … streaming=True, … )

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset( path_or_dataset_id: Union[str, List[str]], column_mapping: Dict[str, str], tokenizer, *, split: Optional[str] = None, name: Optional[str] = None, answer_only_loss_mask: bool = True, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', start_of_turn_token: Optional[str] = None, limit_dataset_samples: Optional[int] = None, repeat_on_exhaustion: bool = True, use_hf_chat_template: bool = False, delta_storage_options: Optional[Dict[str, str]] = None, delta_version: Optional[int] = None, delta_sql_query: Optional[str] = None, )#

Bases: torch.utils.data.IterableDataset, nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

This wraps a Hugging Face streaming dataset (IterableDataset from datasets) or Delta Lake table and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.

Supports the following data sources:

HuggingFace Hub datasets
Local JSON/JSONL files
Delta Lake tables (via delta://, dbfs:/, or local directories with _delta_log)

Initialization

Initialize the dataset.