bridge.models.qwen_vl.modelling_qwen3_vl.text_model#
Copied from https://github.com/Thaurun/mbridge/blob/4462d1e284626d2ed9d3e3e 3e5a40f2ee42a2c74/mbridge/models/qwen3_vl/gpt_model.py
Module Contents#
Classes#
Qwen3-VL GPT model with vision-language capabilities. |
API#
- class bridge.models.qwen_vl.modelling_qwen3_vl.text_model.Qwen3VLGPTModel(
- config: megatron.bridge.models.transformer_config.TransformerConfig,
- transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- vocab_size: int,
- max_sequence_length: int,
- pre_process: bool = True,
- post_process: bool = True,
- fp16_lm_cross_entropy: bool = False,
- parallel_output: bool = True,
- share_embeddings_and_output_weights: bool = False,
- position_embedding_type: Literal[learned_absolute, rope, mrope, none] = 'learned_absolute',
- rotary_percent: float = 1.0,
- rotary_base: int = 10000,
- rope_scaling: bool = False,
- rope_scaling_factor: float = 8.0,
- scatter_embedding_sequence_parallel: bool = True,
- seq_len_interpolation_factor: Optional[float] = None,
- mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.models.gpt.gpt_model.GPTModelQwen3-VL GPT model with vision-language capabilities.
Initialization
- forward(
- input_ids: torch.Tensor,
- position_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- decoder_input: torch.Tensor = None,
- labels: torch.Tensor = None,
- inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
- extra_block_kwargs: dict = None,
- runtime_gather_output: Optional[bool] = None,
- *,
- inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
- loss_mask: Optional[torch.Tensor] = None,
- visual_pos_masks: Optional[torch.Tensor] = None,
- deepstack_visual_embeds: Optional[list[torch.Tensor]] = None,
Forward function of the GPT Model This function passes the input tensors through the embedding layer, and then the decoeder and finally into the post processing layer (optional).
forward pass is overridden to add support for deepstack visual embeddings.
It either returns the Loss values if labels are given or the final hidden units
- Parameters:
runtime_gather_output (bool) – Gather output at runtime. Default None means
parallel_outputarg in the constructor will be used.