bridge.models.qwen_vl.qwen35_vl_provider#
Qwen3.5 VL Model Provider configurations for Megatron-Core.
Qwen3.5 is a family of vision-language models that combine:
A hybrid Gated DeltaNet (GDN) + Gated Attention language model (like Qwen3-Next)
A vision encoder (similar to Qwen3-VL)
Dense MLP or Mixture of Experts (MoE) with shared experts
This module provides two model providers:
Qwen35VLModelProvider: Dense variant (e.g., Qwen3.5-27B) Reference: https://huggingface.co/Qwen/Qwen3.5-27BQwen35VLMoEModelProvider: MoE variant (e.g., Qwen3.5-397B-A17B) Reference: https://huggingface.co/Qwen/Qwen3.5-397B-A17B
Module Contents#
Classes#
Model provider for Qwen3.5 VL Dense (Vision-Language) Models. |
|
Model provider for Qwen 3.5 VL (Vision-Language) Models. |
Functions#
Raise a clear error if transformers doesn’t have qwen3_5 (dense) support. |
|
Raise a clear error if transformers doesn’t have qwen3_5_moe support. |
|
Selectively replace the self_attention module on standard attention layer specs. |
Data#
API#
- bridge.models.qwen_vl.qwen35_vl_provider._TRANSFORMERS_HAS_QWEN3_5_MOE#
None
- bridge.models.qwen_vl.qwen35_vl_provider._check_qwen3_5_available() None#
Raise a clear error if transformers doesn’t have qwen3_5 (dense) support.
- bridge.models.qwen_vl.qwen35_vl_provider._check_qwen3_5_moe_available() None#
Raise a clear error if transformers doesn’t have qwen3_5_moe support.
- class bridge.models.qwen_vl.qwen35_vl_provider.Qwen35VLModelProvider#
Bases:
megatron.bridge.models.gpt_provider.GPTModelProviderModel provider for Qwen3.5 VL Dense (Vision-Language) Models.
Qwen3.5 dense combines a hybrid GDN (Gated DeltaNet) + Gated Attention language model architecture with a vision encoder (similar to Qwen3-VL) and a standard dense MLP (no Mixture of Experts).
Key Architecture Details (27B):
64 layers: 16 groups x (3 GDN + 1 Attention)
Hidden dim: 5120, Intermediate dim: 17408
GDN: 16 QK heads, 48 V heads, head_dim=128
Gated Attention: 24 Q heads, 4 KV heads, head_dim=256
Vision: depth=27, hidden=1152, no deepstack
mRoPE with sections [11, 11, 10], rope_theta=10,000,000
partial_rotary_factor=0.25
- transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec | Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.spec_utils.ModuleSpec]#
None
- layernorm_zero_centered_gamma: bool#
True
- attention_output_gate: bool#
True
- experimental_attention_variant: str#
‘gated_delta_net’
- linear_attention_freq: int | list[int]#
4
- linear_conv_kernel_dim: int#
4
- linear_key_head_dim: int#
128
- linear_value_head_dim: int#
128
- linear_num_key_heads: int#
16
- linear_num_value_heads: int#
48
- normalization: str#
‘RMSNorm’
- gated_linear_unit: bool#
True
- add_bias_linear: bool#
False
- add_qkv_bias: bool#
False
- qk_layernorm: bool#
True
- kv_channels: int | None#
256
- num_query_groups: int#
4
0.0
- attention_dropout: float#
0.0
- attention_softmax_in_fp32: bool#
True
- rotary_base: float#
10000000.0
- rotary_percent: float#
0.25
- seq_length: int#
262144
- vision_config: Any#
‘field(…)’
- position_embedding_type: str#
‘mrope’
- mrope_section: List[int]#
‘field(…)’
- apply_rotary_pos_emb_in_fp32: bool#
False
- image_token_id: int#
248056
- video_token_id: int#
248057
- vision_start_token_id: int#
248053
- vision_end_token_id: int#
248054
- bos_token_id: int#
248045
- eos_token_id: int#
248044
- spatial_merge_size: int#
2
- temporal_patch_size: int#
2
- patch_size: int#
16
- language_max_sequence_length: int#
2048
- scatter_embedding_sequence_parallel: bool#
False
- freeze_language_model: bool#
False
- freeze_vision_model: bool#
False
- freeze_vision_projection: bool#
False
- bias_activation_fusion: bool#
True
- use_hf_vision_model: bool#
False
- vision_dp_when_cp: bool#
False
- hetereogenous_dist_checkpoint: bool#
True
- mtp_num_layers: Optional[int]#
None
- __post_init__()#
- provide(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide a Qwen3.5 VL dense model instance with vision and language components.
- provide_language_model(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide just the language model component without vision.
- class bridge.models.qwen_vl.qwen35_vl_provider.Qwen35VLMoEModelProvider#
Bases:
megatron.bridge.models.gpt_provider.GPTModelProviderModel provider for Qwen 3.5 VL (Vision-Language) Models.
Qwen 3.5 combines a hybrid GDN (Gated DeltaNet) + Gated Attention language model architecture (like Qwen3-Next) with a vision encoder (similar to Qwen3-VL) and Mixture of Experts (MoE) with shared experts.
Key Architecture Details (397B-A17B):
60 layers: 15 groups × (3 GDN-MoE + 1 Attention-MoE)
Hidden dim: 4096, Token Embedding: 248320
GDN: 16 QK heads, 64 V heads, head_dim=128
Gated Attention: 32 Q heads, 2 KV heads, head_dim=256
MoE: 512 experts, 10 routed + 1 shared, expert dim=1024
mRoPE with sections [11, 11, 10], rope_theta=10,000,000
partial_rotary_factor=0.25
Note: num_query_groups corresponds to num_key_value_heads in HF config (for standard Gated Attention layers). GDN layers have separate head counts.
- transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec | Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.spec_utils.ModuleSpec]#
None
- layernorm_zero_centered_gamma: bool#
True
- attention_output_gate: bool#
True
- experimental_attention_variant: str#
‘gated_delta_net’
- linear_attention_freq: int | list[int]#
4
- linear_conv_kernel_dim: int#
4
- linear_key_head_dim: int#
128
- linear_value_head_dim: int#
128
- linear_num_key_heads: int#
16
- linear_num_value_heads: int#
64
- num_moe_experts: int#
512
- moe_router_topk: int#
10
True
- moe_router_dtype: str#
‘fp32’
- moe_router_load_balancing_type: str#
‘global_aux_loss’
- moe_router_pre_softmax: bool#
False
- moe_grouped_gemm: bool#
True
- moe_token_dispatcher_type: str#
‘alltoall’
- moe_permute_fusion: bool#
True
- moe_aux_loss_coeff: float#
0.001
- normalization: str#
‘RMSNorm’
- gated_linear_unit: bool#
True
- add_bias_linear: bool#
False
- add_qkv_bias: bool#
False
- qk_layernorm: bool#
True
- kv_channels: int | None#
256
- num_query_groups: int#
2
0.0
- attention_dropout: float#
0.0
- attention_softmax_in_fp32: bool#
True
- rotary_base: float#
10000000.0
- rotary_percent: float#
0.25
- seq_length: int#
262144
- vision_config: Any#
‘field(…)’
- position_embedding_type: str#
‘mrope’
- mrope_section: List[int]#
‘field(…)’
- apply_rotary_pos_emb_in_fp32: bool#
False
- image_token_id: int#
248056
- video_token_id: int#
248057
- vision_start_token_id: int#
248053
- vision_end_token_id: int#
248054
- bos_token_id: int#
248045
- eos_token_id: int#
248046
- spatial_merge_size: int#
2
- temporal_patch_size: int#
2
- patch_size: int#
16
- language_max_sequence_length: int#
2048
- scatter_embedding_sequence_parallel: bool#
False
- freeze_language_model: bool#
False
- freeze_vision_model: bool#
False
- freeze_vision_projection: bool#
False
- bias_activation_fusion: bool#
True
- use_hf_vision_model: bool#
False
- vision_dp_when_cp: bool#
False
- hetereogenous_dist_checkpoint: bool#
True
- mtp_num_layers: Optional[int]#
None
- __post_init__()#
- provide(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide a Qwen3.5 VL model instance with vision and language components.
Qwen3.5 uses a hybrid architecture (GDN + standard attention). The key challenge is that Qwen3VLModel.init does::
language_transformer_layer_spec.submodules.self_attention.module = Qwen3VLSelfAttention
which assumes a single ModuleSpec and patches ALL layers uniformly. For Qwen3.5, only the standard attention layers (every 4th layer) should get the Qwen3VLSelfAttention override; GDN layers must be left alone.
Solution: build the hybrid TransformerBlockSubmodules spec, selectively patch only the standard attention layer specs, then pass it to Qwen3VLModel. Because GPTModel → TransformerBlock already accepts TransformerBlockSubmodules, we just need to bypass the uniform patch in Qwen3VLModel.init by calling MegatronModule.init directly and constructing the internals ourselves.
- provide_language_model(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide just the language model component without vision.
- bridge.models.qwen_vl.qwen35_vl_provider._patch_standard_attention_specs(
- block_spec: megatron.core.transformer.transformer_block.TransformerBlockSubmodules,
- attention_cls,
Selectively replace the self_attention module on standard attention layer specs.
In a hybrid block spec, each layer spec has a different self_attention submodule:
Standard attention layers have a
SelfAttention-like module.GDN layers have a
GatedDeltaNet-like module.
This function patches only the standard attention layers with attention_cls (e.g.
Qwen3VLSelfAttentionfor mRoPE support), leaving GDN layers unchanged.Detection heuristic: GDN layer specs have
GatedDeltaNet(or similar) as the self_attention module, which does NOT have alinear_qkvsubmodule. Standard attention specs DO havelinear_qkv. We use this to distinguish them.