bridge.data.vlm_datasets.step37_flickr8k.provider#

mbridge DatasetProvider that exposes the Flickr8k → packed-sample pipeline for Step3.7 SFT.

This is the single integration point between the data primitives (template, dataset, samplers, packed dataloader, pack transform) and Megatron-Bridge’s setup → build_pretraining_data_loader flow.

What this provider does (deterministic):

  1. Download intro/flickr8k train CSV + per-row JPGs via huggingface_hub (sync, no async wrapping).

  2. Build :class:Step37Flickr8kDataset with a fresh

    class:

    Step37MultimodalTemplate (loaded from tokenizer_path with trust_remote_code=False).

  3. Build a sync :class:MixedPackedDataloader that probes every sample for its NTP length, runs the weighted in/cross-domain samplers, then runs non_truncation.pack(max_len=...).

  4. Return the packed dataloader as a map-style Dataset so mbridge’s MegatronPretrainingSampler can drive it. Validation / test splits are skipped (Flickr8k only has a train split here).

The collate is the identity — MixedPackedDataloader[idx] already returns the packed dict; we just unwrap the singleton mini-batch list. The downstream forward step (step37_flickr8k_step) does the GPU move + image loading via :func:preprocess_packed_batch.

Module Contents#

Classes#

_FixedPackDataset

Pin every __getitem__ to the same pack, regardless of idx.

Step37Flickr8kSFTDataProvider

Step3.7 Flickr8k SFT dataset provider.

API#

class bridge.data.vlm_datasets.step37_flickr8k.provider._FixedPackDataset(inner, fixed_idx: int)#

Bases: torch.utils.data.Dataset

Pin every __getitem__ to the same pack, regardless of idx.

Wraps a :class:MixedPackedDataloader so the Megatron sampler can hand out arbitrary indices on every DP rank, every step, and they all map to pack fixed_idx. __len__ is reported as a large sentinel (_SENTINEL_LEN) because mbridge size-checks len(dataset) against global_batch_size × train_iters.

Initialization

_SENTINEL_LEN#

10000000

__len__() int#
__getitem__(idx: int)#
class bridge.data.vlm_datasets.step37_flickr8k.provider.Step37Flickr8kSFTDataProvider#

Bases: megatron.bridge.training.config.DatasetProvider

Step3.7 Flickr8k SFT dataset provider.

Set cfg.dataset = Step37Flickr8kSFTDataProvider(...) on a Step3.7 SFT recipe to swap the default CORD-V2 path for Flickr8k packing. Use step37_flickr8k_step as the forward step so the per-step preprocess loads images + builds ImageForInsert.

Note: trust_remote_code is forced False for the tokenizer load. We never instantiate any HF custom Python code.

tokenizer_path: str#

None

Local HF snapshot path with tokenizer.json + chat_template.jinja.

repo_id: str#

‘intro/flickr8k’

split: str#

‘train’

sample_count: Optional[int]#

8

Take only the first N samples — default 8 for a smoke run.

The full Flickr8k train split is ~6000 image+caption pairs (~1 GB of jpgs); leaving this at None triggers a full hf_hub_download of every row, which takes 10+ minutes on a cold cache and is almost never what a user wants. Set explicitly to None from a recipe / CLI override to opt into the full dataset.

caption_key: str#

‘caption_0’

cache_dir: str#

‘.cache/step37_flickr8k’

prompt: str#

‘Describe this image in one sentence.’

image_token_count: int#

None

patch_token_count: int#

None

image_token: str#

None

image_start_token: str#

None

image_end_token: str#

None

patch_start_token: str#

None

patch_end_token: str#

None

max_packing_seqlen: int#

2048

Max number of NTP-length tokens per pack.

seqlen_divisible_by: int#

64

oversize_policy: Literal[drop, extend]#

‘drop’

dataset_sampling: Literal[sequential, random]#

‘random’

fixed_pack_idx: Optional[int]#

None

If set, __getitem__ always returns the pack at this index, ignoring the requested idx. Used by the smoke recipe to feed identical input to every DP rank on every iteration (deterministic single-pack overfit). __len__ is reported as a large sentinel so the Megatron sampler can request any index without IndexError. Leave None for normal training.

img_start_token_id: int#

None

Tokenizer id for <im_start>. Resolved at build time from the actual tokenizer if left at the sentinel -1.

patch_start_token_id: int#

None

Tokenizer id for <patch_start>. Same sentinel rule.

image_size: int#

728

patch_image_size: int#

504

encoder_patch_size: int#

14

seq_length: int#

2048

dataloader_type: Optional[Literal[single, cyclic, external]]#

‘single’

skip_getting_attention_mask_from_dataset: bool#

True

global_data_keys: list#

‘field(…)’

Batch keys broadcast to every PP rank (PP > 0 needs cu_seqlens / position_id even though input_ids / images are only on PP rank 0).

__post_init__()#
_make_template() megatron.bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate#
_resolve_special_token_ids(
template: megatron.bridge.data.vlm_datasets.step37_flickr8k.template.Step37MultimodalTemplate,
) None#

Fill in img_start_token_id / patch_start_token_id from the tokenizer if the user left them at the sentinel value.

_build_train_packed_dataloader() megatron.bridge.data.vlm_datasets.step37_flickr8k.packed_dataloader.MixedPackedDataloader#
build_datasets(
context: megatron.bridge.training.config.DatasetBuildContext,
) Tuple[Optional[Any], Optional[Any], Optional[Any]]#

Build train (packed) / valid / test datasets.

Flickr8k has no canonical val/test split here, so we return None for those two and let mbridge skip eval. (Override split=... if you want to repurpose the train split for validation instead.)