bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils#
Multimodal helpers for the SFT preprocess path.
Re-exports :class:ImageForInsert (defined on the model side) and provides
the build_image_for_insert packer used by SFT preprocess plus
compute_rope_args for per-image patch cu_seqlens.
Do not “improve” the arithmetic here — it must match the layout the model’s forward pass expects exactly.
Module Contents#
Functions#
Pack multimodal data-transform output into language-model insert payloads. |
|
Compute per-image patch cu_seqlens for multimodal RoPE users. |
Data#
Image item type used by multimodal data transforms. |
|
Patch item type used by multimodal data transforms. |
|
API#
- bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.IMAGE_ITEM_TYPE#
0
Image item type used by multimodal data transforms.
- bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.PATCH_ITEM_TYPE#
1
Patch item type used by multimodal data transforms.
- bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.RopeArgsFn#
None
- bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils._stack_images(
- images: collections.abc.Sequence[torch.Tensor],
- *,
- dtype: Optional[torch.dtype],
- to_cuda: bool,
- bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.build_image_for_insert(
- images_and_types: collections.abc.Iterable[tuple[torch.Tensor, int]],
- *,
- patch_start_id: int,
- image_start_id: int,
- limit_images: Optional[int] = None,
- limit_patches: Optional[int] = None,
- rope_args_fn: Optional[bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.RopeArgsFn] = None,
- dtype: Optional[torch.dtype] = torch.bfloat16,
- to_cuda: bool = True,
Pack multimodal data-transform output into language-model insert payloads.
- bridge.data.vlm_datasets.step37_flickr8k.multimodal_utils.compute_rope_args(
- images: collections.abc.Sequence[torch.Tensor],
- patch_size: int,
- *,
- to_cuda: bool = True,
Compute per-image patch cu_seqlens for multimodal RoPE users.