bridge.data.vlm_datasets.step37_flickr8k.provider#
mbridge DatasetProvider that exposes the Flickr8k → packed-sample
pipeline for Step3.7 SFT.
This is the single integration point between the data primitives
(template, dataset, samplers, packed dataloader, pack transform) and
Megatron-Bridge’s setup → build_pretraining_data_loader flow.
What this provider does (deterministic):
Download
intro/flickr8ktrain CSV + per-row JPGs viahuggingface_hub(sync, no async wrapping).Build :class:
Step37Flickr8kDatasetwith a fresh- class:
Step37MultimodalTemplate(loaded fromtokenizer_pathwithtrust_remote_code=False).
Build a sync :class:
MixedPackedDataloaderthat probes every sample for its NTP length, runs the weighted in/cross-domain samplers, then runsnon_truncation.pack(max_len=...).Return the packed dataloader as a map-style
Datasetso mbridge’sMegatronPretrainingSamplercan drive it. Validation / test splits are skipped (Flickr8k only has a train split here).
The collate is the identity — MixedPackedDataloader[idx] already
returns the packed dict; we just unwrap the singleton mini-batch list.
The downstream forward step (step37_flickr8k_step) does the GPU
move + image loading via :func:preprocess_packed_batch.
Module Contents#
Classes#
Pin every |
|
Step3.7 Flickr8k SFT dataset provider. |
API#
- class bridge.data.vlm_datasets.step37_flickr8k.provider._FixedPackDataset(inner, fixed_idx: int)#
Bases:
torch.utils.data.DatasetPin every
__getitem__to the same pack, regardless ofidx.Wraps a :class:
MixedPackedDataloaderso the Megatron sampler can hand out arbitrary indices on every DP rank, every step, and they all map to packfixed_idx.__len__is reported as a large sentinel (_SENTINEL_LEN) because mbridge size-checkslen(dataset)againstglobal_batch_size × train_iters.Initialization
- _SENTINEL_LEN#
10000000
- __len__() int#
- __getitem__(idx: int)#
- class bridge.data.vlm_datasets.step37_flickr8k.provider.Step37Flickr8kSFTDataProvider#
Bases:
megatron.bridge.training.config.DatasetProviderStep3.7 Flickr8k SFT dataset provider.
Set
cfg.dataset = Step37Flickr8kSFTDataProvider(...)on a Step3.7 SFT recipe to swap the default CORD-V2 path for Flickr8k packing. Usestep37_flickr8k_stepas the forward step so the per-steppreprocessloads images + buildsImageForInsert.Note:
trust_remote_codeis forcedFalsefor the tokenizer load. We never instantiate any HF custom Python code.- tokenizer_path: str#
None
Local HF snapshot path with
tokenizer.json+chat_template.jinja.
- repo_id: str#
‘intro/flickr8k’
- split: str#
‘train’
- sample_count: Optional[int]#
8
Take only the first N samples — default
8for a smoke run.The full Flickr8k train split is ~6000 image+caption pairs (~1 GB of jpgs); leaving this at
Nonetriggers a fullhf_hub_downloadof every row, which takes 10+ minutes on a cold cache and is almost never what a user wants. Set explicitly toNonefrom a recipe / CLI override to opt into the full dataset.
- caption_key: str#
‘caption_0’
- cache_dir: str#
‘.cache/step37_flickr8k’
- prompt: str#
‘Describe this image in one sentence.’
- image_token_count: int#
None
- patch_token_count: int#
None
- image_token: str#
None
- image_start_token: str#
None
- image_end_token: str#
None
- patch_start_token: str#
None
- patch_end_token: str#
None
- max_packing_seqlen: int#
2048
Max number of NTP-length tokens per pack.
- seqlen_divisible_by: int#
64
- oversize_policy: Literal[drop, extend]#
‘drop’
- dataset_sampling: Literal[sequential, random]#
‘random’
- fixed_pack_idx: Optional[int]#
None
If set,
__getitem__always returns the pack at this index, ignoring the requestedidx. Used by the smoke recipe to feed identical input to every DP rank on every iteration (deterministic single-pack overfit).__len__is reported as a large sentinel so the Megatron sampler can request any index without IndexError. LeaveNonefor normal training.
- img_start_token_id: int#
None
Tokenizer id for
<im_start>. Resolved at build time from the actual tokenizer if left at the sentinel-1.
- patch_start_token_id: int#
None
Tokenizer id for
<patch_start>. Same sentinel rule.
- image_size: int#
728
- patch_image_size: int#
504
- encoder_patch_size: int#
14
- seq_length: int#
2048
- dataloader_type: Optional[Literal[single, cyclic, external]]#
‘single’
- skip_getting_attention_mask_from_dataset: bool#
True
- global_data_keys: list#
‘field(…)’
Batch keys broadcast to every PP rank (PP > 0 needs cu_seqlens / position_id even though
input_ids/imagesare only on PP rank 0).
- __post_init__()#
- _make_template() megatron.bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate#
- _resolve_special_token_ids(
- template: megatron.bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate,
Fill in
img_start_token_id/patch_start_token_idfrom the tokenizer if the user left them at the sentinel value.
- _build_train_packed_dataloader() megatron.bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader.MixedPackedDataloader#
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,
Build train (packed) / valid / test datasets.
Flickr8k has no canonical val/test split here, so we return
Nonefor those two and let mbridge skip eval. (Overridesplit=...if you want to repurpose the train split for validation instead.)