nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset#
Module Contents#
Classes#
Streaming iterable variant that reuses the column-mapping/tokenization logic. |
Functions#
Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables. |
Data#
API#
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset._load_streaming_dataset(
- path_or_dataset_id: Union[str, List[str]],
- split: Optional[str] = None,
- streaming: bool = False,
- name: Optional[str] = None,
- delta_storage_options: Optional[Dict[str, str]] = None,
- delta_version: Optional[int] = None,
- delta_sql_query: Optional[str] = None,
Load a dataset from HuggingFace Hub, local JSON/JSONL files, or Delta Lake tables.
If path_or_dataset_id resembles a HF repo ID (i.e. of the form
org/datasetand the path does not exist on the local filesystem), we defer todatasets.load_datasetdirectly. If the path is a Delta Lake table (prefixed withdelta://,dbfs:/, or a directory containing_delta_log), we load using the Delta Lake reader. Otherwise, we assume the argument points to one or more local JSON/JSONL files and letdatasets.load_datasetwith the “json” script handle the parsing.- Parameters:
path_or_dataset_id – Either a HF dataset identifier (
org/name), a Delta Lake table path (delta://path/to/table), or a path / list of paths to local.json/.jsonlfiles.split – Optional split to load when retrieving a remote dataset. This parameter is ignored for local files and Delta Lake tables.
streaming – Whether to stream the dataset.
name – Optional name of the dataset configuration/subset to load
delta_storage_options – Optional dict of storage options for Delta Lake cloud authentication (e.g.,
{"DATABRICKS_TOKEN": "dapi..."})delta_version – Optional specific version of the Delta table to read.
delta_sql_query – Optional SQL query to execute against the Delta Lake source. This is supported when running with a SparkSession (Databricks / pyspark) or when using the Databricks SQL Connector. The query must return the columns expected by
column_mapping.
- Returns:
The loaded dataset.
- Return type:
datasets.Dataset
.. rubric:: Examples
Load from HuggingFace Hub
ds = _load_dataset(“org/dataset”, split=”train”)
Load from local Delta Lake table
ds = _load_dataset(“delta:///path/to/delta_table”, streaming=True)
Load from Databricks with authentication
ds = _load_dataset( … “delta://catalog.schema.table”, … delta_storage_options={“DATABRICKS_TOKEN”: “dapi…”}, … streaming=True, … )
- class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
- path_or_dataset_id: Union[str, List[str]],
- column_mapping: Dict[str, str],
- tokenizer,
- *,
- split: Optional[str] = None,
- name: Optional[str] = None,
- answer_only_loss_mask: bool = True,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- start_of_turn_token: Optional[str] = None,
- limit_dataset_samples: Optional[int] = None,
- repeat_on_exhaustion: bool = True,
- use_hf_chat_template: bool = False,
- delta_storage_options: Optional[Dict[str, str]] = None,
- delta_version: Optional[int] = None,
- delta_sql_query: Optional[str] = None,
Bases:
torch.utils.data.IterableDataset,nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDatasetStreaming iterable variant that reuses the column-mapping/tokenization logic.
This wraps a Hugging Face streaming dataset (IterableDataset from
datasets) or Delta Lake table and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.Supports the following data sources:
HuggingFace Hub datasets
Local JSON/JSONL files
Delta Lake tables (via delta://, dbfs:/, or local directories with _delta_log)
Initialization
Initialize the dataset.
- Parameters:
path_or_dataset_id – The path or dataset id of the dataset.
column_mapping – The mapping of the columns.
tokenizer – The tokenizer to use.
split – The split of the dataset to load.
name – The name of the dataset configuration/subset to load
answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.
seq_length – The sequence length to use for padding.
limit_dataset_samples – The number of samples to load from the dataset.
- __iter__() Iterator[Dict[str, List[int]]]#
- __len__() int#
- __getitem__(idx: int) Dict[str, List[int]]#
- set_epoch(epoch: int) None#
- shard(num_shards: int, index: int)#
- shuffle(buffer_size: int = 1000, seed: Optional[int] = None)#