bridge.models.qwen_vl.modelling_qwen3_vl.model#

Module Contents#

Classes#

Qwen3VLModel

Qwen3VL multi-modal model.

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel(
language_transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
vision_transformer_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLConfig,
parallel_output: bool = True,
pre_process: bool = True,
post_process: bool = True,
add_encoder: bool = True,
add_decoder: bool = True,
pg_collection: megatron.core.process_groups_config.ProcessGroupCollection = None,
)#

Bases: megatron.core.transformer.MegatronModule

Qwen3VL multi-modal model.

Parameters:
  • language_transformer_config (TransformerConfig) – Transformer config for the language model.

  • language_transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers of the

  • vision_transformer_config (Qwen3VLConfigHF) – HF config for the vision model.

  • parallel_output (bool) – Do not gather the outputs, keep them split across tensor parallel ranks. This is typically True for training and False for inference.

  • language_rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings in the language model. Defaults to 1.0.

  • pre_process (bool) – Include the embedding layer in the gpt decoder (used with pipeline parallelism). Defaults to True.

  • post_process (bool) – Include an output layer and a layernorm in the gpt decoder (used with pipeline parallelism). Defaults to True.

  • add_encoder (bool) – Construct the encoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the encoder will live on only a subset of the pipeline stages (specifically, only the first stage).

  • add_decoder (bool) – Construct the decoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the decoder will live on only a subset of the pipeline stages (specifically, every stage after the first one).

Initialization

shared_embedding_or_output_weight()#

This is a convenience method to surface the language model’s word embeddings, which is necessary for finalize_model_grads._allreduce_word_embedding_grads.

set_input_tensor(input_tensor) None#
freeze(
freeze_language_model: bool,
freeze_vision_model: bool,
freeze_vision_projection: bool,
)#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False for the module’s parameters.

Parameters:
  • freeze_language_model (bool) – Freeze the language model module.

  • freeze_vision_model (bool) – Freeze the vision model module.

  • freeze_vision_projection (bool) – Freeze the vision projection module.

forward(
input_ids: torch.Tensor,
position_ids: torch.Tensor = None,
attention_mask: torch.Tensor = None,
labels: torch.Tensor = None,
loss_mask: torch.Tensor = None,
inference_params: megatron.core.InferenceParams = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
pixel_values: torch.Tensor = None,
pixel_values_videos: torch.Tensor = None,
image_grid_thw: torch.Tensor = None,
video_grid_thw: torch.Tensor = None,
image_input_mask: torch.Tensor = None,
video_input_mask: torch.Tensor = None,
cp_img_num: list[int] = None,
images_padded: list[bool] = None,
inference_context: object | None = None,
runtime_gather_output: bool | None = None,
**kwargs,
) torch.Tensor#

Forward function of the Qwen3VL model.

there is a workaround for supporting sequence packing with context parallelism#

cp split with sequence packing will make model lose vision token information, so we need to keep#

the original input_ids and pack them after vision embedding is calculated,#

cooporate with verl’s models/mcore/model_forward.py#

pack the combined_embeddings to thd here, we check if packed_seq_params is None to determine if we need to pack the combined_embeddings to thd#

this function needs the position_ids and attention_mask in BSHD format, no matter use packed_seq or not#

param image_data:

input image of shape [total_thw_size, n_features].

type image_data:

torch.Tensor

param input_ids:

input text ids [batch, text_seq_len].

type input_ids:

torch.Tensor

param position_ids:

input text position ids [batch, text_seq_len].

type position_ids:

torch.Tensor

param attention_mask:

attention mask for the language model [batch, 1, combined_seq_len, combined_seq_len].

type attention_mask:

torch.Tensor

param labels:

Optional target text labels [batch, combined_seq_len].

type labels:

torch.Tensor

param inference_params:

Inference-time parameters including KV cache.

type inference_params:

InferenceParams

returns:

Loss of shape [b, s] if labels are provided, otherwise logits of shape [b, s, vocab_size].

rtype:

output (torch.Tensor)