`bridge.data.vlm_datasets.step37_flickr8k.provider`#

mbridge DatasetProvider that exposes the Flickr8k → packed-sample pipeline for Step3.7 SFT.

This is the single integration point between the data primitives (template, dataset, samplers, packed dataloader, pack transform) and Megatron-Bridge’s setup → build_pretraining_data_loader flow.

What this provider does (deterministic):

Download intro/flickr8k train CSV + per-row JPGs via huggingface_hub (sync, no async wrapping).
Build :class:Step37Flickr8kDataset with a fresh

class:

Step37MultimodalTemplate (loaded from tokenizer_path with trust_remote_code=False).
Build a sync :class:MixedPackedDataloader that probes every sample for its NTP length, runs the weighted in/cross-domain samplers, then runs non_truncation.pack(max_len=...).
Return the packed dataloader as a map-style Dataset so mbridge’s MegatronPretrainingSampler can drive it. Validation / test splits are skipped (Flickr8k only has a train split here).

The collate is the identity — MixedPackedDataloader[idx] already returns the packed dict; we just unwrap the singleton mini-batch list. The downstream forward step (step37_flickr8k_step) does the GPU move + image loading via :func:preprocess_packed_batch.

Module Contents#

Classes#

`_FixedPackDataset`	Pin every `__getitem__` to the same pack, regardless of `idx`.
`Step37Flickr8kSFTDataProvider`	Step3.7 Flickr8k SFT dataset provider.

API#

class bridge.data.vlm_datasets.step37_flickr8k.provider._FixedPackDataset(inner, fixed_idx: int)#

Bases: torch.utils.data.Dataset

Pin every __getitem__ to the same pack, regardless of idx.

Wraps a :class:MixedPackedDataloader so the Megatron sampler can hand out arbitrary indices on every DP rank, every step, and they all map to pack fixed_idx. __len__ is reported as a large sentinel (_SENTINEL_LEN) because mbridge size-checks len(dataset) against global_batch_size × train_iters.

Initialization

_SENTINEL_LEN#: 10000000

__len__() → int#

__getitem__(idx: int)#

class bridge.data.vlm_datasets.step37_flickr8k.provider.Step37Flickr8kSFTDataProvider#

Bases: megatron.bridge.training.config.DatasetProvider

Step3.7 Flickr8k SFT dataset provider.

Set cfg.dataset = Step37Flickr8kSFTDataProvider(...) on a Step3.7 SFT recipe to swap the default CORD-V2 path for Flickr8k packing. Use step37_flickr8k_step as the forward step so the per-step preprocess loads images + builds ImageForInsert.

Note: trust_remote_code is forced False for the tokenizer load. We never instantiate any HF custom Python code.

tokenizer_path: str#

None

Local HF snapshot path with tokenizer.json + chat_template.jinja.

repo_id: str#: ‘intro/flickr8k’

split: str#: ‘train’

sample_count: Optional[int]#

Take only the first N samples — default 8 for a smoke run.

The full Flickr8k train split is ~6000 image+caption pairs (~1 GB of jpgs); leaving this at None triggers a full hf_hub_download of every row, which takes 10+ minutes on a cold cache and is almost never what a user wants. Set explicitly to None from a recipe / CLI override to opt into the full dataset.

caption_key: str#: ‘caption_0’

cache_dir: str#: ‘.cache/step37_flickr8k’

prompt: str#: ‘Describe this image in one sentence.’

image_token_count: int#: None

patch_token_count: int#: None

image_token: str#: None

image_start_token: str#: None

image_end_token: str#: None

patch_start_token: str#: None

patch_end_token: str#: None

max_packing_seqlen: int#

2048

Max number of NTP-length tokens per pack.

seqlen_divisible_by: int#: 64

oversize_policy: Literal[drop, extend]#: ‘drop’

dataset_sampling: Literal[sequential, random]#: ‘random’

fixed_pack_idx: Optional[int]#

None

If set, __getitem__ always returns the pack at this index, ignoring the requested idx. Used by the smoke recipe to feed identical input to every DP rank on every iteration (deterministic single-pack overfit). __len__ is reported as a large sentinel so the Megatron sampler can request any index without IndexError. Leave None for normal training.

img_start_token_id: int#

None

Tokenizer id for <im_start>. Resolved at build time from the actual tokenizer if left at the sentinel -1.

patch_start_token_id: int#

None

Tokenizer id for <patch_start>. Same sentinel rule.

image_size: int#: 728

patch_image_size: int#: 504

encoder_patch_size: int#: 14

seq_length: int#: 2048

dataloader_type: Optional[Literal[single, cyclic, external]]#: ‘single’

skip_getting_attention_mask_from_dataset: bool#: True

global_data_keys: list#

‘field(…)’

Batch keys broadcast to every PP rank (PP > 0 needs cu_seqlens / position_id even though input_ids / images are only on PP rank 0).

__post_init__()#

_make_template() → megatron.bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate#

_resolve_special_token_ids( template: megatron.bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate, ) → None#: Fill in img_start_token_id / patch_start_token_id from the tokenizer if the user left them at the sentinel value.

_build_train_packed_dataloader() → megatron.bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader.MixedPackedDataloader#

build_datasets( context: megatron.bridge.training.config.DatasetBuildContext, ) → Tuple[Optional[Any], Optional[Any], Optional[Any]]#

Build train (packed) / valid / test datasets.

Flickr8k has no canonical val/test split here, so we return None for those two and let mbridge skip eval. (Override split=... if you want to repurpose the train split for validation instead.)

bridge.data.vlm_datasets.step37_flickr8k.provider#

Module Contents#

Classes#

API#

`bridge.data.vlm_datasets.step37_flickr8k.provider`#