nemo_automodel.components.datasets.llm.mock_iterable_dataset
#
Module Contents#
Classes#
Mock dataset that generates synthetic data for benchmarking. |
API#
- class nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset(
- vocab_size: int,
- seq_len: int,
- num_samples: int = 1000000,
- batch_size: int = 1,
Bases:
torch.utils.data.IterableDataset
Mock dataset that generates synthetic data for benchmarking.
This dataset generates random tokens similar to the benchmarking script, creating input_ids, labels, and position_ids for each sample.
Initialization
Initialize the mock dataset.
- Parameters:
vocab_size – Size of the vocabulary for generating random tokens
seq_len – Sequence length for each sample
num_samples – Total number of samples to generate (default: 1M for infinite-like dataset)
batch_size – Batch size to yield (default: 1 for unbatched samples)
- __iter__()#
Generate synthetic batches.
- __len__()#
Return the number of samples.