nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset#
Module Contents#
Classes#
Streaming iterable variant that reuses the column-mapping/tokenization logic. |
Data#
API#
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger#
βgetLogger(β¦)β
- class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
- path_or_dataset_id: Union[str, List[str]],
- column_mapping: Dict[str, str],
- tokenizer,
- *,
- split: Optional[str] = None,
- name: Optional[str] = None,
- answer_only_loss_mask: bool = True,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- start_of_turn_token: Optional[str] = None,
- limit_dataset_samples: Optional[int] = None,
- repeat_on_exhaustion: bool = True,
- use_hf_chat_template: bool = False,
Bases:
torch.utils.data.IterableDataset,nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDatasetStreaming iterable variant that reuses the column-mapping/tokenization logic.
This wraps a Hugging Face streaming dataset (IterableDataset from
datasets) and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.Initialization
Initialize the dataset.
- Parameters:
path_or_dataset_id β The path or dataset id of the dataset.
column_mapping β The mapping of the columns.
tokenizer β The tokenizer to use.
split β The split of the dataset to load.
name β The name of the dataset configuration/subset to load
answer_only_loss_mask β Whether to compute the loss mask only on the answer tokens.
seq_length β The sequence length to use for padding.
limit_dataset_samples β The number of samples to load from the dataset.
- __iter__() Iterator[Dict[str, List[int]]]#
- set_epoch(epoch: int) None#
- shard(num_shards: int, index: int)#
- shuffle(buffer_size: int, seed: int)#