bridge.data.vlm_datasets.step37_flickr8k.preprocess#
Per-step preprocess for Step3.7 multimodal SFT.
Runs once per micro-batch. Takes the packed dict from :func:pack_samples
plus the precomputed image_paths and produces the dict that is fed to the
model (already on CUDA, with images = list[ImageForInsert]):
PIL
Image.open+.convert("RGB")(zero-image fallback on error)Image.resize((size, size), BILINEAR)with size =image_size(728 forIMAGE_ITEM_TYPE) orpatch_image_size(504 forPATCH_ITEM_TYPE)/255 → tensor - CLIP_mean / CLIP_std(the CLIP RGB normalization)stack to
[N, 3, H, W]bf16 / cuda via :func:build_image_for_insertattach
rope_cu_seqlensvia :func:compute_rope_args(patch_size = 14)
Module Contents#
Functions#
Open an RGB |
|
Resize → |
|
Load and preprocess images for a packed batch. |
|
Build the model input dict from a packed batch. |
Data#
API#
- bridge.data.vlm_datasets.step37_flickr8k.preprocess._CLIP_MEAN#
(0.48145466, 0.4578275, 0.40821073)
- bridge.data.vlm_datasets.step37_flickr8k.preprocess._CLIP_STD#
(0.26862954, 0.26130258, 0.27577711)
- bridge.data.vlm_datasets.step37_flickr8k.preprocess.logger#
‘getLogger(…)’
- bridge.data.vlm_datasets.step37_flickr8k.preprocess._load_image(path: str) PIL.Image.Image#
Open an RGB
PIL.Image; fall back to a 224×224 zero-image on read failure so a single broken jpg does not crash the run.
- bridge.data.vlm_datasets.step37_flickr8k.preprocess._image_to_tensor(image: PIL.Image.Image, size: int) torch.Tensor#
Resize →
/255→ CLIP-normalize →[3, H, W]float32 CPU.Arithmetic order: BILINEAR resize, then divide by 255 before mean/std. Operates on a contiguous numpy float32 array for a deterministic rounding sequence.
- bridge.data.vlm_datasets.step37_flickr8k.preprocess.load_images(
- image_paths: list[tuple[str, int]],
- *,
- image_size: int,
- patch_image_size: int,
Load and preprocess images for a packed batch.
Returns
[(tensor[3, H, W], image_type), ...], with H/W chosen perimage_type: full image= image_size(default 728), multicrop patch= patch_image_size(default 504).
- bridge.data.vlm_datasets.step37_flickr8k.preprocess.preprocess_packed_batch(
- batch: dict,
- *,
- img_start_token_id: int,
- patch_start_token_id: int,
- image_size: int,
- patch_image_size: int,
- encoder_patch_size: int,
- only_pp_first_stage: bool = True,
Build the model input dict from a packed batch.
Moves the packed dict’s tensors to CUDA and (on PP rank 0) loads images + builds
list[ImageForInsert]. Theimageslist is GPU-resident bf16 and ready forStep37Model.forward.only_pp_first_stageshould beFalsewhen running outside a pipeline-parallel context, orTrueto honor PP rank 0 gating.