bridge.training.utils.visual_inputs#

Module Contents#

Classes#

GenericVisualInputs

Container for visual modality tensors produced by HF processors.

Qwen2_5_VLVisualInputs

Container for Qwen2/Qwen2.5-VL visual modality tensors.

Qwen2AudioInputs

Container for Qwen2-Audio modality tensors.

API#

class bridge.training.utils.visual_inputs.GenericVisualInputs#

Container for visual modality tensors produced by HF processors.

Works with any HF-encoder VLM (Gemma3-VL, Ministral3, GLM-4.5V, etc.). Compatible with vlm_step.py iteration over __dict__ and .normalized_for_model() call.

pixel_values: Optional[torch.Tensor]#

None

pixel_values_videos: Optional[torch.Tensor]#

None

image_grid_thw: Optional[torch.Tensor]#

None

video_grid_thw: Optional[torch.Tensor]#

None

image_sizes: Optional[torch.Tensor]#

None

mm_token_type_ids: Optional[torch.Tensor]#

None

as_model_kwargs() dict[str, torch.Tensor]#

Return a mapping of non-None fields suitable for model forward kwargs.

normalized_for_model() dict[str, torch.Tensor]#

Return non-None fields — no shape normalization needed for generic encoders.

class bridge.training.utils.visual_inputs.Qwen2_5_VLVisualInputs#

Container for Qwen2/Qwen2.5-VL visual modality tensors.

Fields mirror the processor outputs for Qwen2/Qwen2.5-VL. Shapes may be normalized for model consumption via normalized_for_model().

pixel_values: Optional[torch.Tensor]#

None

image_grid_thw: Optional[torch.Tensor]#

None

as_model_kwargs() dict[str, torch.Tensor]#

Return a mapping of non-None fields suitable for model forward kwargs.

normalized_for_model() dict[str, torch.Tensor]#

Return non-None fields with shapes normalized for model expectations.

  • pixel_values: [B, N, C, H, W] -> [B*N, C, H, W]

  • image_grid_thw: [B, N, 3] -> [B*N, 3]

class bridge.training.utils.visual_inputs.Qwen2AudioInputs#

Container for Qwen2-Audio modality tensors.

Fields mirror the processor outputs for Qwen2-Audio. The model expects input_features (mel spectrograms) and feature_attention_mask.

input_features: Optional[torch.Tensor]#

None

feature_attention_mask: Optional[torch.Tensor]#

None

as_model_kwargs() dict[str, torch.Tensor]#

Return a mapping of non-None fields suitable for model forward kwargs.

normalized_for_model() dict[str, torch.Tensor]#

Return non-None fields (no shape normalization needed for audio).