bridge.models.qwen_vl.modelling_qwen3_vl.model#

Module Contents#

Classes#

Qwen3VLModel

Qwen3VL multi-modal model.

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel(
language_transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
vision_transformer_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLConfig,
parallel_output: bool = True,
pre_process: bool = True,
post_process: bool = True,
add_encoder: bool = True,
add_decoder: bool = True,
)#

Bases: megatron.core.transformer.MegatronModule

Qwen3VL multi-modal model.

Parameters:
  • language_transformer_config (TransformerConfig) – Transformer config for the language model.

  • language_transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers of the

  • vision_transformer_config (TransformerConfig) – Transformer config for the vision model, copy from HF config.

  • parallel_output (bool) – Do not gather the outputs, keep them split across tensor parallel ranks. This is typically True for training and False for inference.

  • language_rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings in the language model. Defaults to 1.0.

  • pre_process (bool) – Include the embedding layer in the gpt decoder (used with pipeline parallelism). Defaults to True.

  • post_process (bool) – Include an output layer and a layernorm in the gpt decoder (used with pipeline parallelism). Defaults to True.

  • add_encoder (bool) – Construct the encoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the encoder will live on only a subset of the pipeline stages (specifically, only the first stage).

  • add_decoder (bool) – Construct the decoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the decoder will live on only a subset of the pipeline stages (specifically, every stage after the first one).

Initialization

shared_embedding_or_output_weight()#

This is a convenience method to surface the language model’s word embeddings, which is necessary for finalize_model_grads._allreduce_word_embedding_grads.

set_input_tensor(input_tensor) None#
freeze(
freeze_language_model: bool,
freeze_vision_model: bool,
freeze_vision_projection: bool,
)#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.

Parameters:
  • freeze_language_model (bool) – Freeze the language model module.

  • freeze_vision_model (bool) – Freeze the vision model module (patch_embed, blocks, pos_embed).

  • freeze_vision_projection (bool) – Freeze the vision projection modules (merger and deepstack_merger_list).

forward(
input_ids: torch.Tensor,
position_ids: torch.Tensor = None,
attention_mask: torch.Tensor = None,
labels: torch.Tensor = None,
inference_params: megatron.core.InferenceParams = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
pixel_values: torch.Tensor = None,
pixel_values_videos: torch.Tensor = None,
image_grid_thw: torch.Tensor = None,
video_grid_thw: torch.Tensor = None,
image_input_mask: torch.Tensor = None,
) torch.Tensor#

Forward function of the Qwen3VL model.

Parameters:
  • image_data (torch.Tensor) – input image of shape [total_thw_size, n_features].

  • input_ids (torch.Tensor) – input text ids [batch, text_seq_len].

  • position_ids (torch.Tensor) – input text position ids [batch, text_seq_len].

  • attention_mask (torch.Tensor) – attention mask for the language model [batch, 1, combined_seq_len, combined_seq_len].

  • labels (torch.Tensor) – Optional target text labels [batch, combined_seq_len].

  • inference_params (InferenceParams) – Inference-time parameters including KV cache.

  • video_start_index – 0 – all video len(video_seq) – all image others – mixture

  • *_input_mask – should not be None in the first PP stage

Returns:

Loss of shape [b, s] if labels are provided, otherwise logits of shape [b, s, vocab_size].

Return type:

output (torch.Tensor)