bridge.models.qwen_vl.modelling_qwen3_vl.vision_model#
Module Contents#
Classes#
Qwen3 ViT vision model. |
Functions#
Pad vision token tensors to |
|
Return |
API#
- bridge.models.qwen_vl.modelling_qwen3_vl.vision_model._maybe_pad_vision_sequence_for_cuda_graph(
- hidden_states: torch.Tensor,
- rotary_pos_emb: torch.Tensor,
- seq_len: int,
- max_seq_len: int,
Pad vision token tensors to
max_seq_lenfor fixed-shape CUDA graphs.- Parameters:
hidden_states –
[seq_len, hidden_size].rotary_pos_emb –
[seq_len, 1, 1, dim](same layout as afterreshape/repeatin :meth:Qwen3VLVisionModel.forward).seq_len – Current sequence length (must match tensor leading size).
max_seq_len – Target length for CUDA graph capture.
- Returns:
Tuple of (padded hidden_states, padded rotary_pos_emb, new seq_len).
- Raises:
ValueError – If
seq_lenexceedsmax_seq_len.
- bridge.models.qwen_vl.modelling_qwen3_vl.vision_model._vision_forward_packed_attention_setup(
- use_cuda_graph_padding: bool,
- hidden_states: torch.Tensor,
- original_seq_len: int,
- seq_len: int,
- grid_thw: torch.Tensor,
- build_packed_seq_params: collections.abc.Callable[[torch.Tensor], megatron.core.packed_seq_params.PackedSeqParams],
Return
(packed_seq_params, attention_mask)for vision encoder forward.When using CUDA graphs, packed sequence metadata (non-tensors) cannot be passed; use full attention on a fixed-length padded sequence and optionally an additive mask to ignore padding.
- Parameters:
use_cuda_graph_padding – Whether vision CUDA graph padding path is active.
hidden_states – Vision hidden states after adding the batch dimension, shape
[S, 1, H].original_seq_len – Sequence length before padding.
seq_len – Sequence length after optional padding (equals
hidden_statesleading size).grid_thw – Grid sizes per image/frame (used only when not using CUDA graph padding).
build_packed_seq_params – Callback to build :class:
PackedSeqParamsfromgrid_thw.
- Returns:
packed_seq_params(Nonewhen using CUDA graph padding) andattention_mask(additive mask for padded CUDA graph runs, elseNone).
- class bridge.models.qwen_vl.modelling_qwen3_vl.vision_model.Qwen3VLVisionModel(
- transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
- transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- patch_merger_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- pre_process: bool = True,
- post_process: bool = True,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
Bases:
megatron.core.models.common.vision_module.vision_module.VisionModuleQwen3 ViT vision model.
- Parameters:
transformer_config (TransformerConfig) – Transformer config.
transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers.
patch_merger_spec (ModuleSpec) – Specifies module to use for transformer layers.
Initialization
- set_input_tensor(input_tensor: torch.Tensor) None#
Sets input tensor to the model.
- Parameters:
input_tensor (Tensor) – Sets the input tensor for the model.
- rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor#
- fast_pos_embed_interpolate(grid_thw)#
- _get_max_vision_seq_length() int#
Get the maximum sequence length for vision encoder CUDA graphs.
- _uses_vision_cuda_graph() bool#
Check if vision encoder CUDA graphs are enabled.
- forward(
- hidden_states: Optional[torch.Tensor],
- grid_thw: torch.Tensor,
- inference_params: Optional[megatron.core.InferenceParams] = None,
- extra_block_kwargs: dict = None,
Forward function of the Qwen3 Vision Model. This function passes the input tensors through the embedding layer and then the transformer.
- Parameters:
x (torch.Tensor) – input image/video data of shape [n_tokens, n_dims]
grid_thw (torch.Tensor) – the size tensor indicates grid size of each image/frame
packed_seq_params (PackedSeqParams) – parameters to build attention mask in the backend
- Returns:
output after final transformer block of shape [b, s, h].
- Return type:
x (torch.Tensor)
- build_packed_seq_params(
- grid_thw: Optional[torch.Tensor],