`bridge.training.utils.visual_inputs`#

Module Contents#

Classes#

`GenericVisualInputs`	Container for visual modality tensors produced by HF processors.
`Qwen2AudioInputs`	Container for Qwen2-Audio modality tensors.

API#

class bridge.training.utils.visual_inputs.GenericVisualInputs#

Container for visual modality tensors produced by HF processors.

Expected input format: Optional HF processor tensor outputs. Qwen-style processors may provide batched image/video tensors with shape [B, N, C, H, W] and THW grid metadata with shape [B, N, 3]. Other processors may provide already flat tensors such as [N, C, H, W] / [N, 3] or model-specific fields such as image_sizes and image_position_ids.

Output format: as_model_kwargs() returns all non-None fields unchanged. normalized_for_model() returns non-None fields with Qwen-style image/video tensors flattened to [B*N, C, H, W] and THW metadata flattened to [B*N, 3]. Already-flat tensors and non-Qwen fields are passed through unchanged.

pixel_values: Optional[torch.Tensor]#: None

pixel_values_videos: Optional[torch.Tensor]#: None

image_grid_thw: Optional[torch.Tensor]#: None

video_grid_thw: Optional[torch.Tensor]#: None

second_per_grid_ts: Optional[torch.Tensor]#: None

image_sizes: Optional[torch.Tensor]#: None

image_position_ids: Optional[torch.Tensor]#: None

mm_token_type_ids: Optional[torch.Tensor]#: None

as_model_kwargs() → dict[str, torch.Tensor]#: Return a mapping of non-None fields suitable for model forward kwargs.

normalized_for_model() → dict[str, torch.Tensor]#

Return non-None fields with Qwen-style batched visual tensors flattened.

pixel_values: [B, N, C, H, W] -> [B*N, C, H, W]
pixel_values_videos: [B, N, C, H, W] -> [B*N, C, H, W]
image_grid_thw: [B, N, 3] -> [B*N, 3]
video_grid_thw: [B, N, 3] -> [B*N, 3]

class bridge.training.utils.visual_inputs.Qwen2AudioInputs#

Container for Qwen2-Audio modality tensors.

Fields mirror the processor outputs for Qwen2-Audio. The model expects input_features (mel spectrograms) and feature_attention_mask.

input_features: Optional[torch.Tensor]#: None

feature_attention_mask: Optional[torch.Tensor]#: None

as_model_kwargs() → dict[str, torch.Tensor]#: Return a mapping of non-None fields suitable for model forward kwargs.

normalized_for_model() → dict[str, torch.Tensor]#: Return non-None fields (no shape normalization needed for audio).

bridge.training.utils.visual_inputs#

Module Contents#

Classes#

API#

`bridge.training.utils.visual_inputs`#