bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader#
Synchronous MixedPackedDataloader.
Instead of being a stateful __next__ iterator, this exposes __len__
__getitem__(idx)so it plugs into mbridge’sMegatronPretrainingSampler+ standard PyTorchDataLoaderflow.
The internal schedule (sample order + non-truncation packing) is computed
once at __init__ from fixed seeds, so the contents of pack idx are
deterministic. Per-step ordering across the train loop may still differ
because mbridge’s sampler shuffles pack indices independently — but each
individual pack is reproducible.
Module Contents#
Classes#
Map-style packed dataset. |
API#
- class bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader.MixedPackedDataloader(
- datasets: list,
- epochs: list[float],
- max_length: int,
- oversize_policy: Literal[drop, extend] = 'extend',
- transform: Optional[collections.abc.Callable] = None,
- dataset_sampling: Union[Literal[sequential, random], list[Literal[sequential, random]]] = 'random',
Bases:
torch.utils.data.DatasetMap-style packed dataset.
Returns a fully assembled packed sample (already passed through
transform) for each index. Used by- Class:
Step37Flickr8kSFTDataProviderto feed mbridge’s standardMegatronPretrainingSampler+ DataLoader.
Initialization
- static _normalize_dataset_sampling(
- dataset_sampling: Union[Literal[sequential, random], list[Literal[sequential, random]]],
- num_datasets: int,
- static _build_in_domain_sampler(
- sampling_strategy: Literal[sequential, random],
- size: int,
- idx: int,
- _schedule_all(
- max_length: int,
- oversize_policy: str = 'drop',
- __len__() int#
- __getitem__(idx: int) Any#
Assemble the pack at index
idxwithout using a mutable internal cursor.Returns the same result for a given
idxon every call: the precomputed in-domain order selects the same samples, which are then run throughtransform.