nemo_automodel.components.datasets.lazy_mapped_dataset#

Module Contents#

Classes#

LazyMappedDataset

Dataset wrapper that applies a transform function on-the-fly instead of preprocessing the whole dataset upfront with .map(fn).

Data#

API#

nemo_automodel.components.datasets.lazy_mapped_dataset.logger#

‘getLogger(…)’

class nemo_automodel.components.datasets.lazy_mapped_dataset.LazyMappedDataset(dataset, map_fn, cache_size=10000)#

Bases: torch.utils.data.Dataset

Dataset wrapper that applies a transform function on-the-fly instead of preprocessing the whole dataset upfront with .map(fn).

Parameters:
  • dataset – Any object that supports __len__ and __getitem__ (e.g. a Hugging Face datasets.Dataset).

  • map_fn – A callable that accepts a single example and returns the transformed example.

  • cache_size – Number of processed items to cache. Defaults to the 10k dataset samples. Set to 0 to disable caching or None to cache all.

Returns:

A map-style dataset that applies map_fn lazily on each item access.

Initialization

_build_get_item() None#

Build the internal item accessor, with or without LRU caching

__getstate__() dict#

Returns pickable state by dropping the unpicklable _get_item function

__setstate__(state: dict) None#

Restores state and rebuild _get_item after unpickling

__len__() int#

Returns the number of items in the dataset

__getitem__(idx: int) Any#

Returns the transformed item at the given index

property cache_info: Any | None#

Return LRU cache statistics, or None if caching is disabled.

__repr__() str#

returns a string representation of the dataset