nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset#

Module Contents#

Classes#

ColumnMappedTextInstructionIterableDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

Data#

API#

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.logger#

β€˜getLogger(…)’

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_iterable_dataset.ColumnMappedTextInstructionIterableDataset(
path_or_dataset_id: Union[str, List[str]],
column_mapping: Dict[str, str],
tokenizer,
*,
split: Optional[str] = None,
name: Optional[str] = None,
answer_only_loss_mask: bool = True,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
start_of_turn_token: Optional[str] = None,
limit_dataset_samples: Optional[int] = None,
repeat_on_exhaustion: bool = True,
use_hf_chat_template: bool = False,
)#

Bases: torch.utils.data.IterableDataset, nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset

Streaming iterable variant that reuses the column-mapping/tokenization logic.

This wraps a Hugging Face streaming dataset (IterableDataset from datasets) and yields tokenized samples compatible with the non-streaming variant, while supporting sharding and epoch-setting for deterministic shuffles upstream.

Initialization

Initialize the dataset.

Parameters:
  • path_or_dataset_id – The path or dataset id of the dataset.

  • column_mapping – The mapping of the columns.

  • tokenizer – The tokenizer to use.

  • split – The split of the dataset to load.

  • name – The name of the dataset configuration/subset to load

  • answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.

  • seq_length – The sequence length to use for padding.

  • limit_dataset_samples – The number of samples to load from the dataset.

__iter__() Iterator[Dict[str, List[int]]]#
set_epoch(epoch: int) None#
shard(num_shards: int, index: int)#
shuffle(buffer_size: int, seed: int)#