bridge.models.qwen_vl.modelling_qwen3_vl.text_model#

Copied from https://github.com/Thaurun/mbridge/blob/4462d1e284626d2ed9d3e3e 3e5a40f2ee42a2c74/mbridge/models/qwen3_vl/gpt_model.py

Module Contents#

Classes#

Qwen3VLGPTModel

Qwen3-VL GPT model with vision-language capabilities.

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.text_model.Qwen3VLGPTModel(
config: megatron.bridge.models.transformer_config.TransformerConfig,
transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
vocab_size: int,
max_sequence_length: int,
pre_process: bool = True,
post_process: bool = True,
fp16_lm_cross_entropy: bool = False,
parallel_output: bool = True,
share_embeddings_and_output_weights: bool = False,
position_embedding_type: Literal[learned_absolute, rope, mrope, none] = 'learned_absolute',
rotary_percent: float = 1.0,
rotary_base: int = 10000,
rope_scaling: bool = False,
rope_scaling_factor: float = 8.0,
scatter_embedding_sequence_parallel: bool = True,
seq_len_interpolation_factor: Optional[float] = None,
mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.models.gpt.gpt_model.GPTModel

Qwen3-VL GPT model with vision-language capabilities.

Initialization

forward(
input_ids: torch.Tensor,
position_ids: torch.Tensor,
attention_mask: torch.Tensor,
decoder_input: torch.Tensor = None,
labels: torch.Tensor = None,
inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
runtime_gather_output: Optional[bool] = None,
*,
inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
loss_mask: Optional[torch.Tensor] = None,
visual_pos_masks: Optional[torch.Tensor] = None,
deepstack_visual_embeds: Optional[list[torch.Tensor]] = None,
) torch.Tensor#

Forward function of the GPT Model This function passes the input tensors through the embedding layer, and then the decoeder and finally into the post processing layer (optional).

forward pass is overridden to add support for deepstack visual embeddings.

It either returns the Loss values if labels are given or the final hidden units

Parameters:

runtime_gather_output (bool) – Gather output at runtime. Default None means parallel_output arg in the constructor will be used.