bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader#

Synchronous MixedPackedDataloader.

Instead of being a stateful __next__ iterator, this exposes __len__

  • __getitem__(idx) so it plugs into mbridge’s MegatronPretrainingSampler + standard PyTorch DataLoader flow.

The internal schedule (sample order + non-truncation packing) is computed once at __init__ from fixed seeds, so the contents of pack idx are deterministic. Per-step ordering across the train loop may still differ because mbridge’s sampler shuffles pack indices independently — but each individual pack is reproducible.

Module Contents#

Classes#

MixedPackedDataloader

Map-style packed dataset.

API#

class bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader.MixedPackedDataloader(
datasets: list,
epochs: list[float],
max_length: int,
oversize_policy: Literal[drop, extend] = 'extend',
transform: Optional[collections.abc.Callable] = None,
dataset_sampling: Union[Literal[sequential, random], list[Literal[sequential, random]]] = 'random',
)#

Bases: torch.utils.data.Dataset

Map-style packed dataset.

Returns a fully assembled packed sample (already passed through transform) for each index. Used by

Class:

Step37Flickr8kSFTDataProvider to feed mbridge’s standard MegatronPretrainingSampler + DataLoader.

Initialization

static _normalize_dataset_sampling(
dataset_sampling: Union[Literal[sequential, random], list[Literal[sequential, random]]],
num_datasets: int,
) list[Literal[sequential, random]]#
static _build_in_domain_sampler(
sampling_strategy: Literal[sequential, random],
size: int,
idx: int,
) Union[megatron.bridge.data.vlm_datasets.step37_flickr8k.samplers.LoopedShuffleSampler, megatron.bridge.data.vlm_datasets.step37_flickr8k.samplers.LoopedSequentialSampler]#
_schedule_all(
max_length: int,
oversize_policy: str = 'drop',
) tuple[list[tuple[int, int]], megatron.bridge.data.vlm_datasets.step37_flickr8k.packing.PackingResult]#
__len__() int#
__getitem__(idx: int) Any#

Assemble the pack at index idx without using a mutable internal cursor.

Returns the same result for a given idx on every call: the precomputed in-domain order selects the same samples, which are then run through transform.