bridge.data.vlm_datasets.step37_flickr8k.preprocess#

Per-step preprocess for Step3.7 multimodal SFT.

Runs once per micro-batch. Takes the packed dict from :func:pack_samples plus the precomputed image_paths and produces the dict that is fed to the model (already on CUDA, with images = list[ImageForInsert]):

  • PIL Image.open + .convert("RGB") (zero-image fallback on error)

  • Image.resize((size, size), BILINEAR) with size = image_size (728 for IMAGE_ITEM_TYPE) or patch_image_size (504 for PATCH_ITEM_TYPE)

  • /255 → tensor - CLIP_mean / CLIP_std (the CLIP RGB normalization)

  • stack to [N, 3, H, W] bf16 / cuda via :func:build_image_for_insert

  • attach rope_cu_seqlens via :func:compute_rope_args (patch_size = 14)

Module Contents#

Functions#

_load_image

Open an RGB PIL.Image; fall back to a 224×224 zero-image on read failure so a single broken jpg does not crash the run.

_image_to_tensor

Resize → /255 → CLIP-normalize → [3, H, W] float32 CPU.

load_images

Load and preprocess images for a packed batch.

preprocess_packed_batch

Build the model input dict from a packed batch.

Data#

API#

bridge.data.vlm_datasets.step37_flickr8k.preprocess._CLIP_MEAN#

(0.48145466, 0.4578275, 0.40821073)

bridge.data.vlm_datasets.step37_flickr8k.preprocess._CLIP_STD#

(0.26862954, 0.26130258, 0.27577711)

bridge.data.vlm_datasets.step37_flickr8k.preprocess.logger#

‘getLogger(…)’

bridge.data.vlm_datasets.step37_flickr8k.preprocess._load_image(path: str) PIL.Image.Image#

Open an RGB PIL.Image; fall back to a 224×224 zero-image on read failure so a single broken jpg does not crash the run.

bridge.data.vlm_datasets.step37_flickr8k.preprocess._image_to_tensor(image: PIL.Image.Image, size: int) torch.Tensor#

Resize → /255 → CLIP-normalize → [3, H, W] float32 CPU.

Arithmetic order: BILINEAR resize, then divide by 255 before mean/std. Operates on a contiguous numpy float32 array for a deterministic rounding sequence.

bridge.data.vlm_datasets.step37_flickr8k.preprocess.load_images(
image_paths: list[tuple[str, int]],
*,
image_size: int,
patch_image_size: int,
) list[tuple[torch.Tensor, int]]#

Load and preprocess images for a packed batch.

Returns [(tensor[3, H, W], image_type), ...], with H/W chosen per image_type: full image = image_size (default 728), multicrop patch = patch_image_size (default 504).

bridge.data.vlm_datasets.step37_flickr8k.preprocess.preprocess_packed_batch(
batch: dict,
*,
img_start_token_id: int,
patch_start_token_id: int,
image_size: int,
patch_image_size: int,
encoder_patch_size: int,
only_pp_first_stage: bool = True,
) dict[str, Any]#

Build the model input dict from a packed batch.

Moves the packed dict’s tensors to CUDA and (on PP rank 0) loads images + builds list[ImageForInsert]. The images list is GPU-resident bf16 and ready for Step37Model.forward.

only_pp_first_stage should be False when running outside a pipeline-parallel context, or True to honor PP rank 0 gating.