nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset#

Module Contents#

Classes#

ColumnMappedTextInstructionIterableDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

Functions#

_load_streaming_dataset

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

Data#

API#

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset._load_streaming_dataset(
path_or_dataset_id: Union[str, List[str]],
split: Optional[str] = None,
streaming: bool = False,
name: Optional[str] = None,
delta_storage_options: Optional[Dict[str, str]] = None,
delta_version: Optional[int] = None,
delta_sql_query: Optional[str] = None,
)#

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. If the path is a Delta Lake table (prefixed with delta://, dbfs:/, or a directory containing _delta_log), we load using the Delta Lake reader. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:
  • path_or_dataset_id – Either a HF dataset identifier (org/name), a Delta Lake table path (delta://path/to/table), or a path / list of paths to local .json / .jsonl files.

  • split – Optional split to load when retrieving a remote dataset. This parameter is ignored for local files and Delta Lake tables.

  • streaming – Whether to stream the dataset.

  • name – Optional name of the dataset configuration/subset to load

  • delta_storage_options – Optional dict of storage options for Delta Lake cloud authentication (e.g., {"DATABRICKS_TOKEN": "dapi..."})

  • delta_version – Optional specific version of the Delta table to read.

  • delta_sql_query – Optional SQL query to execute against the Delta Lake source. This is supported when running with a SparkSession (Databricks / pyspark) or when using the Databricks SQL Connector. The query must return the columns expected by column_mapping.

Returns:

The loaded dataset.

Return type:

datasets.Dataset

.. rubric:: Examples

Load from HuggingFace Hub

ds = _load_dataset(“org/dataset”, split=”train”)

Load from local Delta Lake table

ds = _load_dataset(“delta:///path/to/delta_table”, streaming=True)

Load from Databricks with authentication

ds = _load_dataset( … “delta://catalog.schema.table”, … delta_storage_options={“DATABRICKS_TOKEN”: “dapi…”}, … streaming=True, … )

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
path_or_dataset_id: Union[str, List[str]],
column_mapping: Dict[str, str],
tokenizer,
*,
split: Optional[str] = None,
name: Optional[str] = None,
answer_only_loss_mask: bool = True,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
start_of_turn_token: Optional[str] = None,
limit_dataset_samples: Optional[int] = None,
repeat_on_exhaustion: bool = True,
use_hf_chat_template: bool = False,
delta_storage_options: Optional[Dict[str, str]] = None,
delta_version: Optional[int] = None,
delta_sql_query: Optional[str] = None,
)#

Bases: torch.utils.data.IterableDataset, nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

This wraps a Hugging Face streaming dataset (IterableDataset from datasets) or Delta Lake table and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.

Supports the following data sources:

  • HuggingFace Hub datasets

  • Local JSON/JSONL files

  • Delta Lake tables (via delta://, dbfs:/, or local directories with _delta_log)

Initialization

Initialize the dataset.

Parameters:
  • path_or_dataset_id – The path or dataset id of the dataset.

  • column_mapping – The mapping of the columns.

  • tokenizer – The tokenizer to use.

  • split – The split of the dataset to load.

  • name – The name of the dataset configuration/subset to load

  • answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.

  • seq_length – The sequence length to use for padding.

  • limit_dataset_samples – The number of samples to load from the dataset.

__iter__() Iterator[Dict[str, List[int]]]#
__len__() int#
__getitem__(idx: int) Dict[str, List[int]]#
set_epoch(epoch: int) None#
shard(num_shards: int, index: int)#
shuffle(buffer_size: int = 1000, seed: Optional[int] = None)#