`bridge.models.qwen_vl.modelling_qwen3_vl.vision_model`#

Module Contents#

Classes#

Qwen3VLVisionModel

Qwen3 ViT vision model.

Functions#

`_maybe_pad_vision_sequence_for_cuda_graph`	Pad vision token tensors to `max_seq_len` for fixed-shape CUDA graphs.
`_vision_forward_packed_attention_setup`	Return `(packed_seq_params, attention_mask)` for vision encoder forward.

API#

bridge.models.qwen_vl.modelling_qwen3_vl.vision_model._maybe_pad_vision_sequence_for_cuda_graph( hidden_states: torch.Tensor, rotary_pos_emb: torch.Tensor, seq_len: int, max_seq_len: int, ) → tuple[torch.Tensor, torch.Tensor, int]#

Pad vision token tensors to max_seq_len for fixed-shape CUDA graphs.

Parameters:

hidden_states – [seq_len, hidden_size].
rotary_pos_emb – [seq_len, 1, 1, dim] (same layout as after reshape/repeat in :meth:Qwen3VLVisionModel.forward).
seq_len – Current sequence length (must match tensor leading size).
max_seq_len – Target length for CUDA graph capture.

Returns:

Tuple of (padded hidden_states, padded rotary_pos_emb, new seq_len).

Raises:

ValueError – If seq_len exceeds max_seq_len.

bridge.models.qwen_vl.modelling_qwen3_vl.vision_model._vision_forward_packed_attention_setup( use_cuda_graph_padding: bool, hidden_states: torch.Tensor, original_seq_len: int, seq_len: int, grid_thw: torch.Tensor, build_packed_seq_params: collections.abc.Callable[[torch.Tensor], megatron.core.packed_seq_params.PackedSeqParams], ) → tuple[Optional[megatron.core.packed_seq_params.PackedSeqParams], Optional[torch.Tensor]]#

Return (packed_seq_params, attention_mask) for vision encoder forward.

When using CUDA graphs, packed sequence metadata (non-tensors) cannot be passed; use full attention on a fixed-length padded sequence and optionally an additive mask to ignore padding.

Parameters:

use_cuda_graph_padding – Whether vision CUDA graph padding path is active.
hidden_states – Vision hidden states after adding the batch dimension, shape [S, 1, H].
original_seq_len – Sequence length before padding.
seq_len – Sequence length after optional padding (equals hidden_states leading size).
grid_thw – Grid sizes per image/frame (used only when not using CUDA graph padding).
build_packed_seq_params – Callback to build :class:PackedSeqParams from grid_thw.

Returns:

packed_seq_params (None when using CUDA graph padding) and attention_mask (additive mask for padded CUDA graph runs, else None).

class bridge.models.qwen_vl.modelling_qwen3_vl.vision_model.Qwen3VLVisionModel( transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig, transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec, patch_merger_spec: megatron.core.transformer.spec_utils.ModuleSpec, pre_process: bool = True, post_process: bool = True, pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None, )#

Bases: megatron.core.models.common.vision_module.vision_module.VisionModule

Qwen3 ViT vision model.

Parameters:

transformer_config (TransformerConfig) – Transformer config.
transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers.
patch_merger_spec (ModuleSpec) – Specifies module to use for transformer layers.

Initialization

set_input_tensor(input_tensor: torch.Tensor) → None#

Sets input tensor to the model.

Parameters:: input_tensor (Tensor) – Sets the input tensor for the model.

rot_pos_emb(grid_thw: torch.Tensor) → torch.Tensor#

fast_pos_embed_interpolate(grid_thw)#

_get_max_vision_seq_length() → int#: Get the maximum sequence length for vision encoder CUDA graphs.

_uses_vision_cuda_graph() → bool#: Check if vision encoder CUDA graphs are enabled.

forward( hidden_states: Optional[torch.Tensor], grid_thw: torch.Tensor, inference_params: Optional[megatron.core.InferenceParams] = None, extra_block_kwargs: dict = None, ) → torch.Tensor#

Forward function of the Qwen3 Vision Model. This function passes the input tensors through the embedding layer and then the transformer.

Parameters:

x (torch.Tensor) – input image/video data of shape [n_tokens, n_dims]
grid_thw (torch.Tensor) – the size tensor indicates grid size of each image/frame
packed_seq_params (PackedSeqParams) – parameters to build attention mask in the backend

Returns:

output after final transformer block of shape [b, s, h].

Return type:

x (torch.Tensor)

build_packed_seq_params( grid_thw: Optional[torch.Tensor], ) → megatron.core.packed_seq_params.PackedSeqParams#

bridge.models.qwen_vl.modelling_qwen3_vl.vision_model#

Module Contents#

Classes#

Functions#

API#

`bridge.models.qwen_vl.modelling_qwen3_vl.vision_model`#