bridge.models.qwen_vl.modelling_qwen3_vl.model#
Module Contents#
Classes#
Qwen3VL multi-modal model. |
API#
- class bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel(
- language_transformer_config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
- language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- vision_transformer_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLConfig,
- parallel_output: bool = True,
- pre_process: bool = True,
- post_process: bool = True,
- add_encoder: bool = True,
- add_decoder: bool = True,
Bases:
megatron.core.transformer.MegatronModuleQwen3VL multi-modal model.
- Parameters:
language_transformer_config (TransformerConfig) β Transformer config for the language model.
language_transformer_layer_spec (ModuleSpec) β Specifies module to use for transformer layers of the
vision_transformer_config (TransformerConfig) β Transformer config for the vision model, copy from HF config.
parallel_output (bool) β Do not gather the outputs, keep them split across tensor parallel ranks. This is typically True for training and False for inference.
language_rotary_percent (float) β Percent of rotary dimension to use for rotary position embeddings in the language model. Defaults to 1.0.
pre_process (bool) β Include the embedding layer in the gpt decoder (used with pipeline parallelism). Defaults to True.
post_process (bool) β Include an output layer and a layernorm in the gpt decoder (used with pipeline parallelism). Defaults to True.
add_encoder (bool) β Construct the encoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the encoder will live on only a subset of the pipeline stages (specifically, only the first stage).
add_decoder (bool) β Construct the decoder module (used with pipeline parallelism). Defaults to True. When we use pipelining, the decoder will live on only a subset of the pipeline stages (specifically, every stage after the first one).
Initialization
This is a convenience method to surface the language modelβs word embeddings, which is necessary for
finalize_model_grads._allreduce_word_embedding_grads.
- set_input_tensor(input_tensor) None#
- freeze(
- freeze_language_model: bool,
- freeze_vision_model: bool,
- freeze_vision_projection: bool,
Freeze model modules.
Make specific modules non-trainable by setting requires_grad to False.
- Parameters:
freeze_language_model (bool) β Freeze the language model module.
freeze_vision_model (bool) β Freeze the vision model module (patch_embed, blocks, pos_embed).
freeze_vision_projection (bool) β Freeze the vision projection modules (merger and deepstack_merger_list).
- forward(
- input_ids: torch.Tensor,
- position_ids: torch.Tensor = None,
- attention_mask: torch.Tensor = None,
- labels: torch.Tensor = None,
- inference_params: megatron.core.InferenceParams = None,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
- extra_block_kwargs: dict = None,
- pixel_values: torch.Tensor = None,
- pixel_values_videos: torch.Tensor = None,
- image_grid_thw: torch.Tensor = None,
- video_grid_thw: torch.Tensor = None,
- image_input_mask: torch.Tensor = None,
Forward function of the Qwen3VL model.
- Parameters:
image_data (torch.Tensor) β input image of shape [total_thw_size, n_features].
input_ids (torch.Tensor) β input text ids [batch, text_seq_len].
position_ids (torch.Tensor) β input text position ids [batch, text_seq_len].
attention_mask (torch.Tensor) β attention mask for the language model [batch, 1, combined_seq_len, combined_seq_len].
labels (torch.Tensor) β Optional target text labels [batch, combined_seq_len].
inference_params (InferenceParams) β Inference-time parameters including KV cache.
video_start_index β 0 β all video len(video_seq) β all image others β mixture
*_input_mask β should not be None in the first PP stage
- Returns:
Loss of shape [b, s] if labels are provided, otherwise logits of shape [b, s, vocab_size].
- Return type:
output (torch.Tensor)