bridge.models.qwen_vl.qwen35_vl_provider#

Qwen3.5 VL Model Provider configurations for Megatron-Core.

Qwen3.5 is a family of vision-language models that combine:

  • A hybrid Gated DeltaNet (GDN) + Gated Attention language model (like Qwen3-Next)

  • A vision encoder (similar to Qwen3-VL)

  • Dense MLP or Mixture of Experts (MoE) with shared experts

This module provides two model providers:

  • Qwen35VLModelProvider: Dense variant (e.g., Qwen3.5-27B) Reference: https://huggingface.co/Qwen/Qwen3.5-27B

  • Qwen35VLMoEModelProvider: MoE variant (e.g., Qwen3.5-397B-A17B) Reference: https://huggingface.co/Qwen/Qwen3.5-397B-A17B

Module Contents#

Classes#

Qwen35VLModelProvider

Model provider for Qwen3.5 VL Dense (Vision-Language) Models.

Qwen35VLMoEModelProvider

Model provider for Qwen 3.5 VL (Vision-Language) Models.

Functions#

_check_qwen3_5_available

Raise a clear error if transformers doesn’t have qwen3_5 (dense) support.

_check_qwen3_5_moe_available

Raise a clear error if transformers doesn’t have qwen3_5_moe support.

_patch_standard_attention_specs

Selectively replace the self_attention module on standard attention layer specs.

Data#

API#

bridge.models.qwen_vl.qwen35_vl_provider._TRANSFORMERS_HAS_QWEN3_5_MOE#

None

bridge.models.qwen_vl.qwen35_vl_provider._check_qwen3_5_available() None#

Raise a clear error if transformers doesn’t have qwen3_5 (dense) support.

bridge.models.qwen_vl.qwen35_vl_provider._check_qwen3_5_moe_available() None#

Raise a clear error if transformers doesn’t have qwen3_5_moe support.

class bridge.models.qwen_vl.qwen35_vl_provider.Qwen35VLModelProvider#

Bases: megatron.bridge.models.gpt_provider.GPTModelProvider

Model provider for Qwen3.5 VL Dense (Vision-Language) Models.

Qwen3.5 dense combines a hybrid GDN (Gated DeltaNet) + Gated Attention language model architecture with a vision encoder (similar to Qwen3-VL) and a standard dense MLP (no Mixture of Experts).

Key Architecture Details (27B):

  • 64 layers: 16 groups x (3 GDN + 1 Attention)

  • Hidden dim: 5120, Intermediate dim: 17408

  • GDN: 16 QK heads, 48 V heads, head_dim=128

  • Gated Attention: 24 Q heads, 4 KV heads, head_dim=256

  • Vision: depth=27, hidden=1152, no deepstack

  • mRoPE with sections [11, 11, 10], rope_theta=10,000,000

  • partial_rotary_factor=0.25

transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec | Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.spec_utils.ModuleSpec]#

None

layernorm_zero_centered_gamma: bool#

True

attention_output_gate: bool#

True

experimental_attention_variant: str#

‘gated_delta_net’

linear_attention_freq: int | list[int]#

4

linear_conv_kernel_dim: int#

4

linear_key_head_dim: int#

128

linear_value_head_dim: int#

128

linear_num_key_heads: int#

16

linear_num_value_heads: int#

48

normalization: str#

‘RMSNorm’

gated_linear_unit: bool#

True

add_bias_linear: bool#

False

add_qkv_bias: bool#

False

qk_layernorm: bool#

True

kv_channels: int | None#

256

num_query_groups: int#

4

hidden_dropout: float#

0.0

attention_dropout: float#

0.0

attention_softmax_in_fp32: bool#

True

rotary_base: float#

10000000.0

rotary_percent: float#

0.25

seq_length: int#

262144

vision_config: Any#

‘field(…)’

position_embedding_type: str#

‘mrope’

mrope_section: List[int]#

‘field(…)’

apply_rotary_pos_emb_in_fp32: bool#

False

image_token_id: int#

248056

video_token_id: int#

248057

vision_start_token_id: int#

248053

vision_end_token_id: int#

248054

bos_token_id: int#

248045

eos_token_id: int#

248044

spatial_merge_size: int#

2

temporal_patch_size: int#

2

patch_size: int#

16

language_max_sequence_length: int#

2048

scatter_embedding_sequence_parallel: bool#

False

freeze_language_model: bool#

False

freeze_vision_model: bool#

False

freeze_vision_projection: bool#

False

bias_activation_fusion: bool#

True

use_hf_vision_model: bool#

False

vision_dp_when_cp: bool#

False

hetereogenous_dist_checkpoint: bool#

True

mtp_num_layers: Optional[int]#

None

__post_init__()#
provide(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel#

Provide a Qwen3.5 VL dense model instance with vision and language components.

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Provide just the language model component without vision.

class bridge.models.qwen_vl.qwen35_vl_provider.Qwen35VLMoEModelProvider#

Bases: megatron.bridge.models.gpt_provider.GPTModelProvider

Model provider for Qwen 3.5 VL (Vision-Language) Models.

Qwen 3.5 combines a hybrid GDN (Gated DeltaNet) + Gated Attention language model architecture (like Qwen3-Next) with a vision encoder (similar to Qwen3-VL) and Mixture of Experts (MoE) with shared experts.

Key Architecture Details (397B-A17B):

  • 60 layers: 15 groups × (3 GDN-MoE + 1 Attention-MoE)

  • Hidden dim: 4096, Token Embedding: 248320

  • GDN: 16 QK heads, 64 V heads, head_dim=128

  • Gated Attention: 32 Q heads, 2 KV heads, head_dim=256

  • MoE: 512 experts, 10 routed + 1 shared, expert dim=1024

  • mRoPE with sections [11, 11, 10], rope_theta=10,000,000

  • partial_rotary_factor=0.25

Note: num_query_groups corresponds to num_key_value_heads in HF config (for standard Gated Attention layers). GDN layers have separate head counts.

transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec | Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.spec_utils.ModuleSpec]#

None

layernorm_zero_centered_gamma: bool#

True

attention_output_gate: bool#

True

experimental_attention_variant: str#

‘gated_delta_net’

linear_attention_freq: int | list[int]#

4

linear_conv_kernel_dim: int#

4

linear_key_head_dim: int#

128

linear_value_head_dim: int#

128

linear_num_key_heads: int#

16

linear_num_value_heads: int#

64

num_moe_experts: int#

512

moe_router_topk: int#

10

moe_shared_expert_gate: bool#

True

moe_router_dtype: str#

‘fp32’

moe_router_load_balancing_type: str#

‘global_aux_loss’

moe_router_pre_softmax: bool#

False

moe_grouped_gemm: bool#

True

moe_token_dispatcher_type: str#

‘alltoall’

moe_permute_fusion: bool#

True

moe_aux_loss_coeff: float#

0.001

normalization: str#

‘RMSNorm’

gated_linear_unit: bool#

True

add_bias_linear: bool#

False

add_qkv_bias: bool#

False

qk_layernorm: bool#

True

kv_channels: int | None#

256

num_query_groups: int#

2

hidden_dropout: float#

0.0

attention_dropout: float#

0.0

attention_softmax_in_fp32: bool#

True

rotary_base: float#

10000000.0

rotary_percent: float#

0.25

seq_length: int#

262144

vision_config: Any#

‘field(…)’

position_embedding_type: str#

‘mrope’

mrope_section: List[int]#

‘field(…)’

apply_rotary_pos_emb_in_fp32: bool#

False

image_token_id: int#

248056

video_token_id: int#

248057

vision_start_token_id: int#

248053

vision_end_token_id: int#

248054

bos_token_id: int#

248045

eos_token_id: int#

248046

spatial_merge_size: int#

2

temporal_patch_size: int#

2

patch_size: int#

16

language_max_sequence_length: int#

2048

scatter_embedding_sequence_parallel: bool#

False

freeze_language_model: bool#

False

freeze_vision_model: bool#

False

freeze_vision_projection: bool#

False

bias_activation_fusion: bool#

True

use_hf_vision_model: bool#

False

vision_dp_when_cp: bool#

False

hetereogenous_dist_checkpoint: bool#

True

mtp_num_layers: Optional[int]#

None

__post_init__()#
provide(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.bridge.models.qwen_vl.modelling_qwen3_vl.model.Qwen3VLModel#

Provide a Qwen3.5 VL model instance with vision and language components.

Qwen3.5 uses a hybrid architecture (GDN + standard attention). The key challenge is that Qwen3VLModel.init does::

language_transformer_layer_spec.submodules.self_attention.module = Qwen3VLSelfAttention

which assumes a single ModuleSpec and patches ALL layers uniformly. For Qwen3.5, only the standard attention layers (every 4th layer) should get the Qwen3VLSelfAttention override; GDN layers must be left alone.

Solution: build the hybrid TransformerBlockSubmodules spec, selectively patch only the standard attention layer specs, then pass it to Qwen3VLModel. Because GPTModel → TransformerBlock already accepts TransformerBlockSubmodules, we just need to bypass the uniform patch in Qwen3VLModel.init by calling MegatronModule.init directly and constructing the internals ourselves.

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Provide just the language model component without vision.

bridge.models.qwen_vl.qwen35_vl_provider._patch_standard_attention_specs(
block_spec: megatron.core.transformer.transformer_block.TransformerBlockSubmodules,
attention_cls,
) None#

Selectively replace the self_attention module on standard attention layer specs.

In a hybrid block spec, each layer spec has a different self_attention submodule:

  • Standard attention layers have a SelfAttention-like module.

  • GDN layers have a GatedDeltaNet-like module.

This function patches only the standard attention layers with attention_cls (e.g. Qwen3VLSelfAttention for mRoPE support), leaving GDN layers unchanged.

Detection heuristic: GDN layer specs have GatedDeltaNet (or similar) as the self_attention module, which does NOT have a linear_qkv submodule. Standard attention specs DO have linear_qkv. We use this to distinguish them.