bridge.data.vlm_datasets.step37_flickr8k.pack_transform#

pack() transform for Step3.7 multimodal SFT.

Takes a list of MultimodalSFTSample (the output of

class:

Step37Flickr8kDataset.__getitem__) and produces a single packed dict:

  • tokens : concat of s.tokens[:-1] for each sample

  • labels : concat of s.tokens[1:]

  • loss_masks : concat of s.loss_mask[1:]

  • cu_seqlens : prefix-sum of sample shifted-NTP lengths

  • position_id : per-sub-seq 0..len-1 (via shared helper)

  • image_paths : flat concat of all s.image_paths

A zero-padding sample is appended if the total NTP length isn’t a multiple of seqlen_divisible_by (default 64). The padding sample is included in cu_seqlens so the padded tail forms its own sub-seq.

Module Contents#

Functions#

get_position_id_from_cu_seqlens

Per-sub-seq 0..L-1 position ids.

pack_samples

Pack a list of samples into a single next-token-prediction batch.

API#

bridge.data.vlm_datasets.step37_flickr8k.pack_transform.get_position_id_from_cu_seqlens(
cu_seqlens: torch.Tensor,
) torch.Tensor#

Per-sub-seq 0..L-1 position ids.

Given cu_seqlens = [0, 209, 418, …, total], produces a 1-D tensor of length total where each sub-seq segment counts 0..L-1.

bridge.data.vlm_datasets.step37_flickr8k.pack_transform.pack_samples(
pieces: list[megatron.bridge.data.vlm_datasets.step37_flickr8k.template.MultimodalSFTSample],
*,
seqlen_divisible_by: int = 64,
) dict[str, Any]#

Pack a list of samples into a single next-token-prediction batch.