bridge.models.qwen_vl.qwen3_vl_provider#

Qwen3 VL MoE Model Provider configurations for Megatron-Core.

This module provides configuration classes for Qwen3-VL MoE (Mixture of Experts) multimodal models, compatible with HuggingFace’s Qwen3-VL-MoE model configurations. Reference: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct

Module Contents#

Classes#

Qwen3VLModelProvider

Base model provider for Qwen 3 VL Models. Inherits language model configuration from Qwen3ModelProvider.

Qwen3VLMoEModelProvider

Base model provider for Qwen 3 VL MoE Models. Inherits language model MoE configuration from Qwen3MoEModelProvider.

API#

class bridge.models.qwen_vl.qwen3_vl_provider.Qwen3VLModelProvider#

Bases: megatron.bridge.models.Qwen3ModelProvider

Base model provider for Qwen 3 VL Models. Inherits language model configuration from Qwen3ModelProvider.

Note: num_query_groups in parent class corresponds to num_key_value_heads in HF config. Default value of 8 is used for GQA (Grouped Query Attention).

head_dim: int#

128

hidden_size: int#

2048

language_max_sequence_length: int#

2048

patch_size: int#

14

temporal_patch_size: int#

2

in_channels: int#

3

spatial_merge_size: int#

2

num_position_embeddings: int#

2304

out_hidden_size: int#

2304

apply_rotary_pos_emb_in_fp32: bool#

False

deepstack_visual_indexes: List[int]#

‘field(…)’

fp16_lm_cross_entropy: bool#

False

rotary_percent: float#

1.0

apply_rope_fusion: bool#

False

vision_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLVisionConfig#

‘field(…)’

hf_text_config: Optional[transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLTextConfig]#

None

image_token_id: int#

151655

video_token_id: int#

151656

vision_start_token_id: int#

151652

vision_end_token_id: int#

151653

bos_token_id: int#

151643

eos_token_id: int#

151645

position_embedding_type: str#

‘mrope’

attention_dropout: float#

0.0

attention_softmax_in_fp32: bool#

True

mrope_section: List[int]#

‘field(…)’

rotary_base: float#

5000000.0

scatter_embedding_sequence_parallel: bool#

False

freeze_language_model: bool#

False

freeze_vision_model: bool#

False

freeze_vision_projection: bool#

False

sequence_parallel: bool#

False

qk_layernorm: bool#

True

provide(pre_process=None, post_process=None, vp_stage=None)#

Provide a Qwen3VL model instance with vision and language components.

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Provide just the language model component without vision.

Parameters:
  • pre_process – Whether this is the first stage in pipeline parallelism

  • post_process – Whether this is the last stage in pipeline parallelism

  • vp_stage – Virtual pipeline stage number

Returns:

MCoreGPTModel instance (language model only)

class bridge.models.qwen_vl.qwen3_vl_provider.Qwen3VLMoEModelProvider#

Bases: megatron.bridge.models.Qwen3MoEModelProvider

Base model provider for Qwen 3 VL MoE Models. Inherits language model MoE configuration from Qwen3MoEModelProvider.

Key MoE Parameters (inherited from Qwen3MoEModelProvider):

  • num_moe_experts: Number of total experts (default 128)

  • moe_router_topk: Number of experts selected per token (default 8)

  • moe_router_load_balancing_type: Load balancing strategy (default “aux_loss”)

  • moe_aux_loss_coeff: Auxiliary loss coefficient (default 1e-3)

  • moe_grouped_gemm: Use grouped GEMM for efficiency (default True)

Note: num_query_groups in parent class corresponds to num_key_value_heads in HF config.

vision_config: transformers.models.qwen3_vl.configuration_qwen3_vl.Qwen3VLVisionConfig#

‘field(…)’

hf_text_config: Optional[transformers.models.qwen3_vl_moe.configuration_qwen3_vl_moe.Qwen3VLMoeTextConfig]#

None

pretrained_model_name: str#

‘Qwen/Qwen3-VL-30B-A3B-Instruct’

image_token_id: int#

151655

video_token_id: int#

151656

vision_start_token_id: int#

151652

vision_end_token_id: int#

151653

bos_token_id: int#

151643

eos_token_id: int#

151645

head_dim: int#

128

qk_layernorm: bool#

True

attention_softmax_in_fp32: bool#

True

attention_dropout: float#

0.0

position_embedding_type: str#

‘mrope’

mrope_section: List[int]#

‘field(…)’

rotary_base: float#

5000000.0

spatial_merge_size: int#

2

temporal_patch_size: int#

2

patch_size: int#

16

scatter_embedding_sequence_parallel: bool#

False

moe_router_pre_softmax: bool#

False

moe_router_dtype: str#

‘fp32’

moe_router_score_function: str#

‘softmax’

moe_router_bias_update_rate: float#

0.001

moe_permute_fusion: bool#

True

moe_token_dispatcher_type: str#

‘alltoall’

mlp_only_layers: List[int]#

‘field(…)’

decoder_sparse_step: int#

1

freeze_language_model: bool#

True

freeze_vision_model: bool#

True

freeze_vision_projection: bool#

False

language_max_sequence_length: int#

2048

persist_layer_norm: bool#

True

bias_activation_fusion: bool#

True

bias_dropout_fusion: bool#

True

masked_softmax_fusion: bool#

False

deallocate_pipeline_outputs: bool#

True

async_tensor_model_parallel_allreduce: bool#

True

distribute_saved_activations: bool#

False

cp_comm_type: str#

‘p2p’

finalize() None#
provide(pre_process=None, post_process=None, vp_stage=None)#

Provide a Qwen3VL MoE model instance with vision and language components.

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Provide just the language MoE model component without vision.

Parameters:
  • pre_process – Whether this is the first stage in pipeline parallelism

  • post_process – Whether this is the last stage in pipeline parallelism

  • vp_stage – Virtual pipeline stage number

Returns:

MCoreGPTModel instance (MoE language model only)