bridge.models.qwen_omni.qwen25_omni_provider#

Qwen2.5 Omni Model Provider configurations for Megatron-Core.

This module provides configuration classes for Qwen2.5 Omni multimodal models (audio+vision+text), compatible with HuggingFace’s Qwen2.5-Omni model configurations. Reference: https://huggingface.co/Qwen/Qwen2.5-Omni-7B

Module Contents#

Classes#

Qwen25OmniModelProvider

Base model provider for Qwen2.5 Omni Models. Inherits language model configuration from Qwen2ModelProvider (dense, Qwen2 architecture).

API#

class bridge.models.qwen_omni.qwen25_omni_provider.Qwen25OmniModelProvider#

Bases: megatron.bridge.models.Qwen2ModelProvider

Base model provider for Qwen2.5 Omni Models. Inherits language model configuration from Qwen2ModelProvider (dense, Qwen2 architecture).

Key differences from Qwen3OmniMoeModelProvider:

  • Dense LLM (Qwen2), not MoE

  • Has QKV bias (Qwen2 specific), no QK layernorm

  • mrope_section: [16, 24, 24] (not [24, 20, 20])

  • position_id_per_seconds: 25 (not 13)

  • seconds_per_chunk: 2 for audio-in-video

  • patch_size: 14 (not 16)

  • Uses HF vision model directly (ReplicatedMapping)

thinker_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig#

‘field(…)’

talker_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniTalkerConfig | None#

None

token2wav_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniToken2WavConfig | None#

None

pretrained_model_name: str#

‘Qwen/Qwen2.5-Omni-7B’

image_token_id: int#

151655

video_token_id: int#

151656

audio_token_id: int#

151646

vision_start_token_id: int#

151652

vision_end_token_id: int#

151653

audio_start_token_id: int#

151647

audio_end_token_id: int#

151648

bos_token_id: int#

151643

eos_token_id: int#

151645

head_dim: int#

128

add_qkv_bias: bool#

True

qk_layernorm: bool#

False

attention_softmax_in_fp32: bool#

True

attention_dropout: float#

0.0

position_embedding_type: str#

‘mrope’

apply_rotary_pos_emb_in_fp32: bool#

False

mrope_section: list[int]#

‘field(…)’

rotary_base: float#

1000000

spatial_merge_size: int#

2

temporal_patch_size: int#

2

patch_size: int#

14

scatter_embedding_sequence_parallel: bool#

False

position_id_per_seconds: int#

25

seconds_per_chunk: int#

2

freeze_language_model: bool#

False

freeze_vision_model: bool#

False

freeze_audio_model: bool#

False

language_max_sequence_length: int#

2048

persist_layer_norm: bool#

True

bias_activation_fusion: bool#

True

bias_dropout_fusion: bool#

True

masked_softmax_fusion: bool#

False

deallocate_pipeline_outputs: bool#

True

async_tensor_model_parallel_allreduce: bool#

True

distribute_saved_activations: bool#

False

cp_comm_type: str#

‘p2p’

provide(pre_process=None, post_process=None, vp_stage=None)#

Provide a Qwen2.5 Omni model instance with vision, audio, and language components.

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Provide just the language model component without vision/audio.