nemo_automodel.components.datasets.multimodal.distributed_iterable

View as Markdown

DistributedIterableDataset base for BAGEL-style data pipelines.

Module Contents

Classes

NameDescription
DistributedIterableDatasetBase class for rank/worker-aware iterable datasets.

Data

logger

API

class nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset(
dataset_name,
local_rank = 0,
world_size = 1,
num_workers = 8
)

Bases: IterableDataset

Base class for rank/worker-aware iterable datasets.

Owns a private rng used only to shuffle file paths deterministically in :meth:set_epoch — NOT used for per-sample randomness. Per-sample randomness still goes through the Python global random module (see :mod:packing for the reseed hook).

_drop_counters
= {}
data_paths_per_rank
= []
num_files_per_rank
= 0
rng
= random.Random()
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.__iter__()
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset._get_worker_data_status(
worker_id
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset._log_drop(
reason,
message,
args = (),
every = 100,
exc_info = False
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset._set_worker_resume_data_status(
worker_id,
status
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.get_data_paths(
args = (),
kwargs = {}
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.get_data_paths_per_worker()
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.load_state_dict(
state_dict
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.set_data_status(
data_status
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.set_epoch(
seed = 42
)
nemo_automodel.components.datasets.multimodal.distributed_iterable.DistributedIterableDataset.state_dict()
nemo_automodel.components.datasets.multimodal.distributed_iterable.logger = logging.getLogger(__name__)