`bridge.training.utils.visual_inputs`#

Module Contents#

Classes#

Qwen2_5_VLVisualInputs

Container for Qwen2/Qwen2.5-VL visual modality tensors.

API#

class bridge.training.utils.visual_inputs.Qwen2_5_VLVisualInputs#

Container for Qwen2/Qwen2.5-VL visual modality tensors.

Fields mirror the processor outputs for Qwen2/Qwen2.5-VL. Shapes may be normalized for model consumption via normalized_for_model().

pixel_values: Optional[torch.Tensor]#: None

image_grid_thw: Optional[torch.Tensor]#: None

as_model_kwargs() → dict[str, torch.Tensor]#: Return a mapping of non-None fields suitable for model forward kwargs.

normalized_for_model() → dict[str, torch.Tensor]#

Return non-None fields with shapes normalized for model expectations.

pixel_values: [B, N, C, H, W] -> [B*N, C, H, W]
image_grid_thw: [B, N, 3] -> [B*N, 3]