nemo_automodel.components.datasets.llm.mock_iterable_dataset

View as Markdown

Module Contents

Classes

NameDescription
MockIterableDatasetMock dataset that generates synthetic data for benchmarking.

API

class nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset(
vocab_size: int = 1024,
seq_len: int = 1024,
num_samples: int = 1000000,
batch_size: int = 1
)

Bases: IterableDataset

Mock dataset that generates synthetic data for benchmarking.

This dataset generates random tokens similar to the benchmarking script, creating input_ids, labels, and position_ids for each sample.

nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset.__iter__()

Generate synthetic batches.

nemo_automodel.components.datasets.llm.mock_iterable_dataset.MockIterableDataset.__len__()

Return the number of samples.