`bridge.models.qwen_vl.modelling_qwen3_vl.model`#

Module Contents#

Classes#

Qwen3VLModel

Qwen3VL multi-modal model.

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel( language_transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig, language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec, vision_transformer_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLConfig, parallel_output: bool = True, pre_process: bool = True, post_process: bool = True, add_encoder: bool = True, add_decoder: bool = True, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection = None, mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None, vp_stage: Optional[int] = None, )#

Bases: megatron.core.transformer.MegatronModule

Qwen3VL multi-modal model.

Parameters:

language_transformer_config (TransformerConfig) – Transformer config for the language model.
language_transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers of the
vision_transformer_config (Qwen3VLConfigHF) – HF config for the vision model.
parallel_output (bool) – Do not gather the outputs, keep them split across tensor parallel ranks. This is typically True for training and False for inference.
language_rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings in the language model. Defaults to 1.0.
pre_process (bool) – Include the embedding layer in the gpt decoder (used with pipeline parallelism). Defaults to True.
post_process (bool) – Include an output layer and a layernorm in the gpt decoder (used with pipeline parallelism). Defaults to True.
add_encoder (bool) – Construct the encoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the encoder will live on only a subset of the pipeline stages (specifically, only the first stage).
add_decoder (bool) – Construct the decoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the decoder will live on only a subset of the pipeline stages (specifically, every stage after the first one).

Initialization

_expose_language_model_for_cuda_graph_helper() → None#

Expose LM fields on the VLM root for the CUDA graph helper when cuda_graph_impl is enabled.

The CUDA graph helper expects position_embedding_type, rotary_pos_emb, and decoder on the model, but in Qwen3-VL these live on language_model. Assigning decoder here shadows the :meth:decoder property for this instance only when graphs are used.

shared_embedding_or_output_weight()#: This is a convenience method to surface the language model’s word embeddings, which is necessary for finalize_model_grads._allreduce_word_embedding_grads.

property decoder#

Expose language model decoder for mcore inference compatibility.

mcore’s MambaInferenceStateConfig.from_model() calls get_attr_wrapped_model(model, “decoder”), which only traverses .module wrappers. VLM models store the decoder under language_model.decoder, so we expose it here to allow the Mamba check to run and correctly return None.

set_dist_train_input_tensors(input_tensor) → None#

Set input tensor for the model for dist train.

Parameters:: input_tensor (list) – Input tensor.

set_input_tensor(input_tensor) → None#

Set input tensor for the model.

Parameters:: input_tensor (list) – Input tensor.

freeze( freeze_language_model: bool, freeze_vision_model: bool, freeze_vision_projection: bool, )#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False for the module’s parameters.

Parameters:

freeze_language_model (bool) – Freeze the language model module.
freeze_vision_model (bool) – Freeze the vision model module.
freeze_vision_projection (bool) – Freeze the vision projection module.

forward(

input_ids: torch.Tensor,

position_ids: torch.Tensor = None,

attention_mask: torch.Tensor = None,

labels: torch.Tensor = None,

loss_mask: torch.Tensor = None,

inference_params: megatron.core.InferenceParams = None,

packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,

extra_block_kwargs: dict = None,

pixel_values: torch.Tensor = None,

pixel_values_videos: torch.Tensor = None,

image_grid_thw: torch.Tensor = None,

video_grid_thw: torch.Tensor = None,

image_input_mask: torch.Tensor = None,

video_input_mask: torch.Tensor = None,

cp_img_num: list[int] = None,

images_padded: list[bool] = None,

inference_context: object | None = None,

runtime_gather_output: bool | None = None,

mm_token_type_ids: torch.Tensor = None,

**kwargs,

) → torch.Tensor#

Forward function of the Qwen3VL model.

there is a workaround for supporting sequence packing with context parallelism#

cp split with sequence packing will make model lose vision token information, so we need to keep#

the original input_ids and pack them after vision embedding is calculated,#

cooporate with verl’s models/mcore/model_forward.py#

pack the combined_embeddings to thd here, we check if packed_seq_params is None to determine if we need to pack the combined_embeddings to thd#

this function needs the position_ids and attention_mask in BSHD format, no matter use packed_seq or not#

param image_data:: input image of shape [total_thw_size, n_features].
type image_data:: torch.Tensor
param input_ids:: input text ids [batch, text_seq_len].
type input_ids:: torch.Tensor
param position_ids:: input text position ids [batch, text_seq_len].
type position_ids:: torch.Tensor
param attention_mask:: attention mask for the language model [batch, 1, combined_seq_len, combined_seq_len].
type attention_mask:: torch.Tensor
param labels:: Optional target text labels [batch, combined_seq_len].
type labels:: torch.Tensor
param inference_params:: Inference-time parameters including KV cache.
type inference_params:: InferenceParams
param mm_token_type_ids:: Token type IDs from transformers >= 5.3.0 processors. Not used by Qwen3VL (which computes its own rope positions).
type mm_token_type_ids:: torch.Tensor
returns:: Loss of shape [b, s] if labels are provided, otherwise logits of shape [b, s, vocab_size].
rtype:: output (torch.Tensor)

bridge.models.qwen_vl.modelling_qwen3_vl.model#