nemo_automodel.components.datasets.lazy_mapped_dataset

View as Markdown

Module Contents

Classes

NameDescription
LazyMappedDatasetDataset wrapper that applies a transform function on-the-fly instead of

Data

logger

API

class nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset(
dataset,
map_fn,
cache_size = 10000
)

Bases: Dataset

Dataset wrapper that applies a transform function on-the-fly instead of preprocessing the whole dataset upfront with .map(fn).

Parameters:

dataset

Any object that supports __len__ and __getitem__ (e.g. a Hugging Face datasets.Dataset).

map_fn

A callable that accepts a single example and returns the transformed example.

cache_size
Defaults to 10000

Number of processed items to cache. Defaults to the 10k dataset samples. Set to 0 to disable caching or None to cache all.

Returns:

A map-style dataset that applies map_fn lazily on each item access.

cache_info
Any | None

Return LRU cache statistics, or None if caching is disabled.

nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset.__getitem__(
idx: int
) -> typing.Any

Returns the transformed item at the given index

nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset.__getstate__() -> dict

Returns pickable state by dropping the unpicklable _get_item function

nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset.__len__() -> int

Returns the number of items in the dataset

nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset.__repr__() -> str

returns a string representation of the dataset

nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset.__setstate__(
state: dict
) -> None

Restores state and rebuild _get_item after unpickling

nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset._build_get_item() -> None

Build the internal item accessor, with or without LRU caching

nemo_automodel.components.datasets.lazy_mapped_dataset.logger = logging.getLogger(__name__)