bridge.models.qwen_vl.modelling_qwen3_vl.vision_model#

Module Contents#

Classes#

Qwen3VLVisionModel

Qwen3 ViT vision model.

Functions#

_maybe_pad_vision_sequence_for_cuda_graph

Pad vision token tensors to max_seq_len for fixed-shape CUDA graphs.

_vision_forward_packed_attention_setup

Return (packed_seq_params, attention_mask) for vision encoder forward.

API#

bridge.models.qwen_vl.modelling_qwen3_vl.vision_model._maybe_pad_vision_sequence_for_cuda_graph(
hidden_states: torch.Tensor,
rotary_pos_emb: torch.Tensor,
seq_len: int,
max_seq_len: int,
) tuple[torch.Tensor, torch.Tensor, int]#

Pad vision token tensors to max_seq_len for fixed-shape CUDA graphs.

Parameters:
  • hidden_states – [seq_len, hidden_size].

  • rotary_pos_emb – [seq_len, 1, 1, dim] (same layout as after reshape/repeat in :meth:Qwen3VLVisionModel.forward).

  • seq_len – Current sequence length (must match tensor leading size).

  • max_seq_len – Target length for CUDA graph capture.

Returns:

Tuple of (padded hidden_states, padded rotary_pos_emb, new seq_len).

Raises:

ValueError – If seq_len exceeds max_seq_len.

bridge.models.qwen_vl.modelling_qwen3_vl.vision_model._vision_forward_packed_attention_setup(
use_cuda_graph_padding: bool,
hidden_states: torch.Tensor,
original_seq_len: int,
seq_len: int,
grid_thw: torch.Tensor,
build_packed_seq_params: collections.abc.Callable[[torch.Tensor], megatron.core.packed_seq_params.PackedSeqParams],
) tuple[Optional[megatron.core.packed_seq_params.PackedSeqParams], Optional[torch.Tensor]]#

Return (packed_seq_params, attention_mask) for vision encoder forward.

When using CUDA graphs, packed sequence metadata (non-tensors) cannot be passed; use full attention on a fixed-length padded sequence and optionally an additive mask to ignore padding.

Parameters:
  • use_cuda_graph_padding – Whether vision CUDA graph padding path is active.

  • hidden_states – Vision hidden states after adding the batch dimension, shape [S, 1, H].

  • original_seq_len – Sequence length before padding.

  • seq_len – Sequence length after optional padding (equals hidden_states leading size).

  • grid_thw – Grid sizes per image/frame (used only when not using CUDA graph padding).

  • build_packed_seq_params – Callback to build :class:PackedSeqParams from grid_thw.

Returns:

packed_seq_params (None when using CUDA graph padding) and attention_mask (additive mask for padded CUDA graph runs, else None).

class bridge.models.qwen_vl.modelling_qwen3_vl.vision_model.Qwen3VLVisionModel(
transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
patch_merger_spec: megatron.core.transformer.spec_utils.ModuleSpec,
pre_process: bool = True,
post_process: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
)#

Bases: megatron.core.models.common.vision_module.vision_module.VisionModule

Qwen3 ViT vision model.

Parameters:
  • transformer_config (TransformerConfig) – Transformer config.

  • transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers.

  • patch_merger_spec (ModuleSpec) – Specifies module to use for transformer layers.

Initialization

set_input_tensor(input_tensor: torch.Tensor) None#

Sets input tensor to the model.

Parameters:

input_tensor (Tensor) – Sets the input tensor for the model.

rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor#
fast_pos_embed_interpolate(grid_thw)#
_get_max_vision_seq_length() int#

Get the maximum sequence length for vision encoder CUDA graphs.

_uses_vision_cuda_graph() bool#

Check if vision encoder CUDA graphs are enabled.

forward(
hidden_states: Optional[torch.Tensor],
grid_thw: torch.Tensor,
inference_params: Optional[megatron.core.InferenceParams] = None,
extra_block_kwargs: dict = None,
) torch.Tensor#

Forward function of the Qwen3 Vision Model. This function passes the input tensors through the embedding layer and then the transformer.

Parameters:
  • x (torch.Tensor) – input image/video data of shape [n_tokens, n_dims]

  • grid_thw (torch.Tensor) – the size tensor indicates grid size of each image/frame

  • packed_seq_params (PackedSeqParams) – parameters to build attention mask in the backend

Returns:

output after final transformer block of shape [b, s, h].

Return type:

x (torch.Tensor)

build_packed_seq_params(
grid_thw: Optional[torch.Tensor],
) megatron.core.packed_seq_params.PackedSeqParams#