nemo_automodel.components.datasets.reservoir_sampler#
Module Contents#
Classes#
Streaming shuffle with a fixed-size buffer. |
API#
- class nemo_automodel.components.datasets.reservoir_sampler.ReservoirSampler(
- iterator: Iterable[Dict[str, Any]],
- buffer_size: int,
- seed: Optional[int] = None,
Streaming shuffle with a fixed-size buffer.
This is a bounded-memory shuffling wrapper for streaming datasets/iterables. It maintains a buffer of
buffer_sizeitems. Once the buffer is filled, it repeatedly:samples a random buffer slot
yields the evicted item
replaces it with the next item from the underlying iterator
When the underlying iterator is exhausted, the remaining buffer items are yielded.
Initialization
Reservoir sampler is a sampler that samples items from an iterator using a buffer. It is used to sample items from an iterator in a way that is memory efficient.
- Parameters:
iterator – Iterator to sample from.
buffer_size – Size of the buffer.
seed – Seed for the random number generator.