nemo_automodel.components.datasets.multimodal.packing

View as Markdown

PackedDataset + DataConfig — packed-sequence iterable for BAGEL training.

Module Contents

Classes

NameDescription
DataConfigContainer for the packing-level knobs + grouped-dataset YAML dict.
PackedDatasetGreedy pack of samples drawn from weighted groups into token-budgeted batches.

Data

logger

API

class nemo_automodel.components.datasets.multimodal.packing.DataConfig(
grouped_datasets,
text_cond_dropout_prob = 0.1,
vit_cond_dropout_prob = 0.4,
vae_cond_dropout_prob = 0.1,
vae_image_downsample = 16,
max_latent_size = 32,
vit_patch_size = 14,
max_num_patch_per_side = 70
)

Container for the packing-level knobs + grouped-dataset YAML dict.

class nemo_automodel.components.datasets.multimodal.packing.PackedDataset(
data_config,
tokenizer,
special_tokens,
local_rank,
world_size,
num_workers,
expected_num_tokens = 32768,
max_num_tokens_per_sample = 16384,
max_num_tokens = 36864,
prefer_buffer_before = 16384,
max_buffer_size = 50,
interpolate_pos = False,
use_flex = False,
data_status = None,
dataset_info = None
)

Bases: IterableDataset

Greedy pack of samples drawn from weighted groups into token-budgeted batches.

The dataset reseeds at iterator start so AM sees a deterministic BAGEL-compatible packed-data stream regardless of earlier RNG consumption during model construction or checkpoint loading.

_drop_counters
= {}
_resume_buffer
= []
_resume_sequence_status
= self.set_sequence_status()
_yielded_batches
= 0
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.__iter__()
nemo_automodel.components.datasets.multimodal.packing.PackedDataset._grouped_dataset_state_dicts()
nemo_automodel.components.datasets.multimodal.packing.PackedDataset._load_grouped_dataset_state_dicts(
states
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset._load_rng_state_dict(
state
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset._log_drop(
reason,
message,
args = (),
every = 100
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset._rng_state_dict()
nemo_automodel.components.datasets.multimodal.packing.PackedDataset._set_resume_point(
buffer,
yielded_batches
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.build_datasets(
datasets_metainfo,
data_status
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.load_state_dict(
state_dict
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.pack_sequence(
sample,
sequence_status
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.set_epoch(
seed
)
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.set_sequence_status()
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.state_dict()
nemo_automodel.components.datasets.multimodal.packing.PackedDataset.to_tensor(
sequence_status
)
nemo_automodel.components.datasets.multimodal.packing.logger = logging.getLogger(__name__)