bridge.models.qwen_vl.modelling_qwen3_vl.vision_model#

Module Contents#

Classes#

Qwen3VLVisionModel

Qwen3 ViT vision model.

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.vision_model.Qwen3VLVisionModel(
transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
patch_merger_spec: megatron.core.transformer.spec_utils.ModuleSpec,
pre_process: bool = True,
post_process: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
)#

Bases: megatron.core.models.common.vision_module.vision_module.VisionModule

Qwen3 ViT vision model.

Parameters:
  • transformer_config (TransformerConfig) – Transformer config.

  • transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers.

  • patch_merger_spec (ModuleSpec) – Specifies module to use for transformer layers.

Initialization

set_input_tensor(input_tensor: torch.Tensor) None#

Sets input tensor to the model.

Parameters:

input_tensor (Tensor) – Sets the input tensor for the model.

rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor#
fast_pos_embed_interpolate(grid_thw)#
forward(
hidden_states: Optional[torch.Tensor],
grid_thw: torch.Tensor,
inference_params: Optional[megatron.core.InferenceParams] = None,
extra_block_kwargs: dict = None,
) torch.Tensor#

Forward function of the Qwen3 Vision Model. This function passes the input tensors through the embedding layer and then the transformer.

Parameters:
  • x (torch.Tensor) – input image/video data of shape [n_tokens, n_dims]

  • grid_thw (torch.Tensor) – the size tensor indicates grid size of each image/frame

  • packed_seq_params (PackedSeqParams) – parameters to build attention mask in the backend

Returns:

output after final transformer block of shape [b, s, h].

Return type:

x (torch.Tensor)

build_packed_seq_params(
grid_thw: Optional[torch.Tensor],
) megatron.core.packed_seq_params.PackedSeqParams#