bridge.models.qwen_vl.modelling_qwen3_vl.vision_model#
Module Contents#
Classes#
Qwen3 ViT vision model. |
API#
- class bridge.models.qwen_vl.modelling_qwen3_vl.vision_model.Qwen3VLVisionModel(
- transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
- transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- patch_merger_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- pre_process: bool = True,
- post_process: bool = True,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
Bases:
megatron.core.models.common.vision_module.vision_module.VisionModuleQwen3 ViT vision model.
- Parameters:
transformer_config (TransformerConfig) – Transformer config.
transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers.
patch_merger_spec (ModuleSpec) – Specifies module to use for transformer layers.
Initialization
- set_input_tensor(input_tensor: torch.Tensor) None#
Sets input tensor to the model.
- Parameters:
input_tensor (Tensor) – Sets the input tensor for the model.
- rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor#
- fast_pos_embed_interpolate(grid_thw)#
- forward(
- hidden_states: Optional[torch.Tensor],
- grid_thw: torch.Tensor,
- inference_params: Optional[megatron.core.InferenceParams] = None,
- extra_block_kwargs: dict = None,
Forward function of the Qwen3 Vision Model. This function passes the input tensors through the embedding layer and then the transformer.
- Parameters:
x (torch.Tensor) – input image/video data of shape [n_tokens, n_dims]
grid_thw (torch.Tensor) – the size tensor indicates grid size of each image/frame
packed_seq_params (PackedSeqParams) – parameters to build attention mask in the backend
- Returns:
output after final transformer block of shape [b, s, h].
- Return type:
x (torch.Tensor)
- build_packed_seq_params(
- grid_thw: Optional[torch.Tensor],