nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset
Module Contents
Classes
Functions
Data
API
Bases: IterableDataset, ColumnMappedTextInstructionDataset
Streaming iterable variant that reuses the column-mapping/tokenization logic.
This wraps a Hugging Face streaming dataset (IterableDataset from datasets)
or Delta Lake table and yields tokenized samples compatible with the non-streaming
variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.
Supports the following data sources:
- HuggingFace Hub datasets
- Local JSON/JSONL files
- Delta Lake tables (via delta://, dbfs:/, or local directories with _delta_log)
Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.
If path_or_dataset_id resembles a HF repo ID (i.e. of the form
org/dataset and the path does not exist on the local filesystem),
we defer to datasets.load_dataset directly. If the path is a Delta Lake
table (prefixed with delta://, dbfs:/, or a directory containing
_delta_log), we load using the Delta Lake reader. Otherwise, we assume
the argument points to one or more local JSON/JSONL files and let
datasets.load_dataset with the “json” script handle the parsing.
Parameters:
Either a HF dataset identifier (org/name),
a Delta Lake table path (delta://path/to/table), or
a path / list of paths to local .json / .jsonl files.
Optional split to load when retrieving a remote dataset. This parameter is ignored for local files and Delta Lake tables.
Whether to stream the dataset.
Optional name of the dataset configuration/subset to load
Optional dict of storage options for Delta Lake
cloud authentication (e.g., {"DATABRICKS_TOKEN": "dapi..."})
Optional specific version of the Delta table to read.
Optional SQL query to execute against the Delta Lake source.
This is supported when running with a SparkSession (Databricks / pyspark)
or when using the Databricks SQL Connector. The query must return the
columns expected by column_mapping.
Returns:
datasets.Dataset: The loaded dataset.
Examples: