nemo_automodel.components.datasets.reservoir_sampler

View as Markdown

Module Contents

Classes

NameDescription
ReservoirSamplerStreaming shuffle with a fixed-size buffer.

API

class nemo_automodel.components.datasets.reservoir_sampler.ReservoirSampler(
iterator: typing.Iterable[typing.Dict[str, typing.Any]],
buffer_size: int,
seed: typing.Optional[int] = None
)

Streaming shuffle with a fixed-size buffer.

This is a bounded-memory shuffling wrapper for streaming datasets/iterables. It maintains a buffer of buffer_size items. Once the buffer is filled, it repeatedly:

  • samples a random buffer slot
  • yields the evicted item
  • replaces it with the next item from the underlying iterator

When the underlying iterator is exhausted, the remaining buffer items are yielded.

_buffer_size
= int(buffer_size)
nemo_automodel.components.datasets.reservoir_sampler.ReservoirSampler.__getitem__(
idx: int
) -> typing.Dict[str, typing.Any]

No getitem method is supported with ReservoirSampler.

nemo_automodel.components.datasets.reservoir_sampler.ReservoirSampler.__iter__() -> typing.Iterator[typing.Dict[str, typing.Any]]

Iterate over the iterator and sample items from the buffer.

nemo_automodel.components.datasets.reservoir_sampler.ReservoirSampler.__len__() -> int

No len methods is supported with ReservoirSampler.