bridge.training.utils.visual_inputs#
Module Contents#
Classes#
Container for visual modality tensors produced by HF processors. |
|
Container for Qwen2/Qwen2.5-VL visual modality tensors. |
|
Container for Qwen2-Audio modality tensors. |
API#
- class bridge.training.utils.visual_inputs.GenericVisualInputs#
Container for visual modality tensors produced by HF processors.
Works with any HF-encoder VLM (Gemma3-VL, Ministral3, GLM-4.5V, etc.). Compatible with
vlm_step.pyiteration over__dict__and.normalized_for_model()call.- pixel_values: Optional[torch.Tensor]#
None
- pixel_values_videos: Optional[torch.Tensor]#
None
- image_grid_thw: Optional[torch.Tensor]#
None
- video_grid_thw: Optional[torch.Tensor]#
None
- image_sizes: Optional[torch.Tensor]#
None
- mm_token_type_ids: Optional[torch.Tensor]#
None
- as_model_kwargs() dict[str, torch.Tensor]#
Return a mapping of non-None fields suitable for model forward kwargs.
- normalized_for_model() dict[str, torch.Tensor]#
Return non-None fields — no shape normalization needed for generic encoders.
- class bridge.training.utils.visual_inputs.Qwen2_5_VLVisualInputs#
Container for Qwen2/Qwen2.5-VL visual modality tensors.
Fields mirror the processor outputs for Qwen2/Qwen2.5-VL. Shapes may be normalized for model consumption via normalized_for_model().
- pixel_values: Optional[torch.Tensor]#
None
- image_grid_thw: Optional[torch.Tensor]#
None
- as_model_kwargs() dict[str, torch.Tensor]#
Return a mapping of non-None fields suitable for model forward kwargs.
- normalized_for_model() dict[str, torch.Tensor]#
Return non-None fields with shapes normalized for model expectations.
pixel_values: [B, N, C, H, W] -> [B*N, C, H, W]
image_grid_thw: [B, N, 3] -> [B*N, 3]
- class bridge.training.utils.visual_inputs.Qwen2AudioInputs#
Container for Qwen2-Audio modality tensors.
Fields mirror the processor outputs for Qwen2-Audio. The model expects
input_features(mel spectrograms) andfeature_attention_mask.- input_features: Optional[torch.Tensor]#
None
- feature_attention_mask: Optional[torch.Tensor]#
None
- as_model_kwargs() dict[str, torch.Tensor]#
Return a mapping of non-None fields suitable for model forward kwargs.
- normalized_for_model() dict[str, torch.Tensor]#
Return non-None fields (no shape normalization needed for audio).