nemo_automodel.components.datasets.reservoir_sampler#

Module Contents#

Classes#

ReservoirSampler

Streaming shuffle with a fixed-size buffer.

API#

class nemo_automodel.components.datasets.reservoir_sampler.ReservoirSampler(
iterator: Iterable[Dict[str, Any]],
buffer_size: int,
seed: Optional[int] = None,
)[source]#

Streaming shuffle with a fixed-size buffer.

This is a bounded-memory shuffling wrapper for streaming datasets/iterables. It maintains a buffer of buffer_size items. Once the buffer is filled, it repeatedly:

  • samples a random buffer slot

  • yields the evicted item

  • replaces it with the next item from the underlying iterator

When the underlying iterator is exhausted, the remaining buffer items are yielded.

Initialization

Reservoir sampler is a sampler that samples items from an iterator using a buffer. It is used to sample items from an iterator in a way that is memory efficient.

Parameters:
  • iterator – Iterator to sample from.

  • buffer_size – Size of the buffer.

  • seed – Seed for the random number generator.

__iter__() Iterator[Dict[str, Any]][source]#

Iterate over the iterator and sample items from the buffer.

__len__() int[source]#

No len methods is supported with ReservoirSampler.

__getitem__(idx: int) Dict[str, Any][source]#

No getitem method is supported with ReservoirSampler.