`bridge.models.qwen_vl.modelling_qwen3_vl.model`#

Module Contents#

Classes#

Qwen3VLModel

Qwen3VL multi-modal model.

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel( language_transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig, language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec, vision_transformer_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLConfig, parallel_output: bool = True, pre_process: bool = True, post_process: bool = True, add_encoder: bool = True, add_decoder: bool = True, )#

Bases: megatron.core.transformer.MegatronModule

Qwen3VL multi-modal model.

Parameters:

language_transformer_config (TransformerConfig) – Transformer config for the language model.
language_transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers of the
vision_transformer_config (TransformerConfig) – Transformer config for the vision model, copy from HF config.
parallel_output (bool) – Do not gather the outputs, keep them split across tensor parallel ranks. This is typically True for training and False for inference.
language_rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings in the language model. Defaults to 1.0.
pre_process (bool) – Include the embedding layer in the gpt decoder (used with pipeline parallelism). Defaults to True.
post_process (bool) – Include an output layer and a layernorm in the gpt decoder (used with pipeline parallelism). Defaults to True.
add_encoder (bool) – Construct the encoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the encoder will live on only a subset of the pipeline stages (specifically, only the first stage).
add_decoder (bool) – Construct the decoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the decoder will live on only a subset of the pipeline stages (specifically, every stage after the first one).

Initialization

shared_embedding_or_output_weight()#: This is a convenience method to surface the language model’s word embeddings, which is necessary for finalize_model_grads._allreduce_word_embedding_grads.

set_input_tensor(input_tensor) → None#

freeze( freeze_language_model: bool, freeze_vision_model: bool, freeze_vision_projection: bool, )#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.

Parameters:

freeze_language_model (bool) – Freeze the language model module.
freeze_vision_model (bool) – Freeze the vision model module (patch_embed, blocks, pos_embed).
freeze_vision_projection (bool) – Freeze the vision projection modules (merger and deepstack_merger_list).

forward( input_ids: torch.Tensor, position_ids: torch.Tensor = None, attention_mask: torch.Tensor = None, labels: torch.Tensor = None, inference_params: megatron.core.InferenceParams = None, packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None, extra_block_kwargs: dict = None, pixel_values: torch.Tensor = None, pixel_values_videos: torch.Tensor = None, image_grid_thw: torch.Tensor = None, video_grid_thw: torch.Tensor = None, image_input_mask: torch.Tensor = None, ) → torch.Tensor#

Forward function of the Qwen3VL model.

Parameters:

image_data (torch.Tensor) – input image of shape [total_thw_size, n_features].
input_ids (torch.Tensor) – input text ids [batch, text_seq_len].
position_ids (torch.Tensor) – input text position ids [batch, text_seq_len].
attention_mask (torch.Tensor) – attention mask for the language model [batch, 1, combined_seq_len, combined_seq_len].
labels (torch.Tensor) – Optional target text labels [batch, combined_seq_len].
inference_params (InferenceParams) – Inference-time parameters including KV cache.
video_start_index – 0 – all video len(video_seq) – all image others – mixture
*_input_mask – should not be None in the first PP stage

Returns:

Loss of shape [b, s] if labels are provided, otherwise logits of shape [b, s, vocab_size].

Return type:

output (torch.Tensor)

bridge.models.qwen_vl.modelling_qwen3_vl.model#

Module Contents#

Classes#

API#

`bridge.models.qwen_vl.modelling_qwen3_vl.model`#