bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils#

Multimodal helpers for the SFT preprocess path.

Re-exports :class:ImageForInsert (defined on the model side) and provides the build_image_for_insert packer used by SFT preprocess plus compute_rope_args for per-image patch cu_seqlens.

Do not “improve” the arithmetic here — it must match the layout the model’s forward pass expects exactly.

Module Contents#

Functions#

_stack_images

build_image_for_insert

Pack multimodal data-transform output into language-model insert payloads.

compute_rope_args

Compute per-image patch cu_seqlens for multimodal RoPE users.

Data#

IMAGE_ITEM_TYPE

Image item type used by multimodal data transforms.

PATCH_ITEM_TYPE

Patch item type used by multimodal data transforms.

RopeArgsFn

API#

bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.IMAGE_ITEM_TYPE#

0

Image item type used by multimodal data transforms.

bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.PATCH_ITEM_TYPE#

1

Patch item type used by multimodal data transforms.

bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.RopeArgsFn#

None

bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils._stack_images(
images: collections.abc.Sequence[torch.Tensor],
*,
dtype: Optional[torch.dtype],
to_cuda: bool,
) torch.Tensor#
bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.build_image_for_insert(
images_and_types: collections.abc.Iterable[tuple[torch.Tensor, int]],
*,
patch_start_id: int,
image_start_id: int,
limit_images: Optional[int] = None,
limit_patches: Optional[int] = None,
rope_args_fn: Optional[bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.RopeArgsFn] = None,
dtype: Optional[torch.dtype] = torch.bfloat16,
to_cuda: bool = True,
) list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]#

Pack multimodal data-transform output into language-model insert payloads.

bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.compute_rope_args(
images: collections.abc.Sequence[torch.Tensor],
patch_size: int,
*,
to_cuda: bool = True,
) tuple[torch.Tensor, int]#

Compute per-image patch cu_seqlens for multimodal RoPE users.