nemo_automodel.components.datasets.llm.mock_iterable_dataset#

Module Contents#

Classes#

MockIterableDataset

Mock dataset that generates synthetic data for benchmarking.

API#

class nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset(
vocab_size: int,
seq_len: int,
num_samples: int = 1000000,
batch_size: int = 1,
)#

Bases: torch.utils.data.IterableDataset

Mock dataset that generates synthetic data for benchmarking.

This dataset generates random tokens similar to the benchmarking script, creating input_ids, labels, and position_ids for each sample.

Initialization

Initialize the mock dataset.

Parameters:
  • vocab_size – Size of the vocabulary for generating random tokens

  • seq_len – Sequence length for each sample

  • num_samples – Total number of samples to generate (default: 1M for infinite-like dataset)

  • batch_size – Batch size to yield (default: 1 for unbatched samples)

__iter__()#

Generate synthetic batches.

__len__()#

Return the number of samples.