nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset

View as Markdown

Module Contents

Classes

NameDescription
ColumnMappedTextInstructionIterableDatasetStreaming iterable variant that reuses the column-mapping/tokenization logic.

Functions

NameDescription
_load_streaming_datasetLoad a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

Data

logger

API

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
path_or_dataset_id: typing.Union[str, typing.List[str]],
column_mapping: typing.Dict[str, str],
tokenizer,
split: typing.Optional[str] = None,
name: typing.Optional[str] = None,
answer_only_loss_mask: bool = True,
seq_length: typing.Optional[int] = None,
padding: typing.Union[str, bool] = 'do_not_pad',
truncation: typing.Union[str, bool] = 'do_not_truncate',
start_of_turn_token: typing.Optional[str] = None,
limit_dataset_samples: typing.Optional[int] = None,
repeat_on_exhaustion: bool = True,
use_hf_chat_template: bool = False,
delta_storage_options: typing.Optional[typing.Dict[str, str]] = None,
delta_version: typing.Optional[int] = None,
delta_sql_query: typing.Optional[str] = None
)

Bases: IterableDataset, ColumnMappedTextInstructionDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

This wraps a Hugging Face streaming dataset (IterableDataset from datasets) or Delta Lake table and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.

Supports the following data sources:

  • HuggingFace Hub datasets
  • Local JSON/JSONL files
  • Delta Lake tables (via delta://, dbfs:/, or local directories with _delta_log)
_current_epoch_for_repeat
= 0
num_shards
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__getitem__(
idx: int
) -> typing.Dict[str, typing.List[int]]
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.List[int]]]
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.__len__() -> int
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.set_epoch(
epoch: int
) -> None
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.shard(
num_shards: int,
index: int
)
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset.shuffle(
buffer_size: int = 1000,
seed: typing.Optional[int] = None
)
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset._load_streaming_dataset(
path_or_dataset_id: typing.Union[str, typing.List[str]],
split: typing.Optional[str] = None,
streaming: bool = False,
name: typing.Optional[str] = None,
delta_storage_options: typing.Optional[typing.Dict[str, str]] = None,
delta_version: typing.Optional[int] = None,
delta_sql_query: typing.Optional[str] = None
)

Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. If the path is a Delta Lake table (prefixed with delta://, dbfs:/, or a directory containing _delta_log), we load using the Delta Lake reader. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:

path_or_dataset_id
Union[str, List[str]]

Either a HF dataset identifier (org/name), a Delta Lake table path (delta://path/to/table), or a path / list of paths to local .json / .jsonl files.

split
Optional[str]Defaults to None

Optional split to load when retrieving a remote dataset. This parameter is ignored for local files and Delta Lake tables.

streaming
boolDefaults to False

Whether to stream the dataset.

name
Optional[str]Defaults to None

Optional name of the dataset configuration/subset to load

delta_storage_options
Optional[Dict[str, str]]Defaults to None

Optional dict of storage options for Delta Lake cloud authentication (e.g., {"DATABRICKS_TOKEN": "dapi..."})

delta_version
Optional[int]Defaults to None

Optional specific version of the Delta table to read.

delta_sql_query
Optional[str]Defaults to None

Optional SQL query to execute against the Delta Lake source. This is supported when running with a SparkSession (Databricks / pyspark) or when using the Databricks SQL Connector. The query must return the columns expected by column_mapping.

Returns:

datasets.Dataset: The loaded dataset.

Examples:

>>> # Load from HuggingFace Hub
>>> ds = _load_dataset("org/dataset", split="train")
>>> # Load from local Delta Lake table
>>> ds = _load_dataset("delta:///path/to/delta_table", streaming=True)
>>> # Load from Databricks with authentication
>>> ds = _load_dataset(
... "delta://catalog.schema.table",
... delta_storage_options={"DATABRICKS_TOKEN": "dapi..."},
... streaming=True,
... )
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger = logging.getLogger(__name__)