bridge.data.vlm_datasets.step37_flickr8k.pack_transform#
pack() transform for Step3.7 multimodal SFT.
Takes a list of MultimodalSFTSample (the output of
- class:
Step37Flickr8kDataset.__getitem__) and produces a single packed dict:
tokens: concat ofs.tokens[:-1]for each samplelabels: concat ofs.tokens[1:]loss_masks: concat ofs.loss_mask[1:]cu_seqlens: prefix-sum of sample shifted-NTP lengthsposition_id: per-sub-seq 0..len-1 (via shared helper)image_paths: flat concat of alls.image_paths
A zero-padding sample is appended if the total NTP length isn’t a multiple
of seqlen_divisible_by (default 64). The padding sample is included in
cu_seqlens so the padded tail forms its own sub-seq.
Module Contents#
Functions#
Per-sub-seq 0..L-1 position ids. |
|
Pack a list of samples into a single next-token-prediction batch. |
API#
- bridge.data.vlm_datasets.step37_flickr8k.pack_transform.get_position_id_from_cu_seqlens(
- cu_seqlens: torch.Tensor,
Per-sub-seq 0..L-1 position ids.
Given cu_seqlens = [0, 209, 418, …, total], produces a 1-D tensor of length
totalwhere each sub-seq segment counts 0..L-1.
- bridge.data.vlm_datasets.step37_flickr8k.pack_transform.pack_samples(
- pieces: list[megatron.bridge.data.vlm_datasets.step37_flickr8k.template.MultimodalSFTSample],
- *,
- seqlen_divisible_by: int = 64,
Pack a list of samples into a single next-token-prediction batch.