`bridge.data.vlm_datasets.step37_flickr8k.preprocess`#

Per-step preprocess for Step3.7 multimodal SFT.

Runs once per micro-batch. Takes the packed dict from :func:pack_samples plus the precomputed image_paths and produces the dict that is fed to the model (already on CUDA, with images = list[ImageForInsert]):

PIL Image.open + .convert("RGB") (zero-image fallback on error)
Image.resize((size, size), BILINEAR) with size = image_size (728 for IMAGE_ITEM_TYPE) or patch_image_size (504 for PATCH_ITEM_TYPE)
/255 → tensor - CLIP_mean / CLIP_std (the CLIP RGB normalization)
stack to [N, 3, H, W] bf16 / cuda via :func:build_image_for_insert
attach rope_cu_seqlens via :func:compute_rope_args (patch_size = 14)

Module Contents#

Functions#

`_load_image`	Open an RGB `PIL.Image`; fall back to a 224×224 zero-image on read failure so a single broken jpg does not crash the run.
`_image_to_tensor`	Resize → `/255` → CLIP-normalize → `[3, H, W]` float32 CPU.
`load_images`	Load and preprocess images for a packed batch.
`preprocess_packed_batch`	Build the model input dict from a packed batch.

Data#

`_CLIP_MEAN`
`_CLIP_STD`
`logger`

API#

bridge.data.vlm_datasets.step37_flickr8k.preprocess._CLIP_MEAN#: (0.48145466, 0.4578275, 0.40821073)

bridge.data.vlm_datasets.step37_flickr8k.preprocess._CLIP_STD#: (0.26862954, 0.26130258, 0.27577711)

bridge.data.vlm_datasets.step37_flickr8k.preprocess.logger#: ‘getLogger(…)’

bridge.data.vlm_datasets.step37_flickr8k.preprocess._load_image(path: str) → PIL.Image.Image#: Open an RGB PIL.Image; fall back to a 224×224 zero-image on read failure so a single broken jpg does not crash the run.

bridge.data.vlm_datasets.step37_flickr8k.preprocess._image_to_tensor(image: PIL.Image.Image, size: int) → torch.Tensor#

Resize → /255 → CLIP-normalize → [3, H, W] float32 CPU.

Arithmetic order: BILINEAR resize, then divide by 255 before mean/std. Operates on a contiguous numpy float32 array for a deterministic rounding sequence.

bridge.data.vlm_datasets.step37_flickr8k.preprocess.load_images( image_paths: list[tuple[str, int]], *, image_size: int, patch_image_size: int, ) → list[tuple[torch.Tensor, int]]#

Load and preprocess images for a packed batch.

Returns [(tensor[3, H, W], image_type), ...], with H/W chosen per image_type: full image = image_size (default 728), multicrop patch = patch_image_size (default 504).

bridge.data.vlm_datasets.step37_flickr8k.preprocess.preprocess_packed_batch( batch: dict, *, img_start_token_id: int, patch_start_token_id: int, image_size: int, patch_image_size: int, encoder_patch_size: int, only_pp_first_stage: bool = True, ) → dict[str, Any]#

Build the model input dict from a packed batch.

Moves the packed dict’s tensors to CUDA and (on PP rank 0) loads images + builds list[ImageForInsert]. The images list is GPU-resident bf16 and ready for Step37Model.forward.

only_pp_first_stage should be False when running outside a pipeline-parallel context, or True to honor PP rank 0 gating.

bridge.data.vlm_datasets.step37_flickr8k.preprocess#

Module Contents#

Functions#

Data#

API#

`bridge.data.vlm_datasets.step37_flickr8k.preprocess`#