bridge.models.qwen_omni.qwen25_omni_provider#
Qwen2.5 Omni Model Provider configurations for Megatron-Core.
This module provides configuration classes for Qwen2.5 Omni multimodal models (audio+vision+text), compatible with HuggingFace’s Qwen2.5-Omni model configurations. Reference: https://huggingface.co/Qwen/Qwen2.5-Omni-7B
Module Contents#
Classes#
Base model provider for Qwen2.5 Omni Models. Inherits language model configuration from Qwen2ModelProvider (dense, Qwen2 architecture). |
API#
- class bridge.models.qwen_omni.qwen25_omni_provider.Qwen25OmniModelProvider#
Bases:
megatron.bridge.models.Qwen2ModelProviderBase model provider for Qwen2.5 Omni Models. Inherits language model configuration from Qwen2ModelProvider (dense, Qwen2 architecture).
Key differences from Qwen3OmniMoeModelProvider:
Dense LLM (Qwen2), not MoE
Has QKV bias (Qwen2 specific), no QK layernorm
mrope_section: [16, 24, 24] (not [24, 20, 20])
position_id_per_seconds: 25 (not 13)
seconds_per_chunk: 2 for audio-in-video
patch_size: 14 (not 16)
Uses HF vision model directly (ReplicatedMapping)
- thinker_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig#
‘field(…)’
- talker_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniTalkerConfig | None#
None
- token2wav_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniToken2WavConfig | None#
None
- pretrained_model_name: str#
‘Qwen/Qwen2.5-Omni-7B’
- image_token_id: int#
151655
- video_token_id: int#
151656
- audio_token_id: int#
151646
- vision_start_token_id: int#
151652
- vision_end_token_id: int#
151653
- audio_start_token_id: int#
151647
- audio_end_token_id: int#
151648
- bos_token_id: int#
151643
- eos_token_id: int#
151645
- head_dim: int#
128
- add_qkv_bias: bool#
True
- qk_layernorm: bool#
False
- attention_softmax_in_fp32: bool#
True
- attention_dropout: float#
0.0
- position_embedding_type: str#
‘mrope’
- apply_rotary_pos_emb_in_fp32: bool#
False
- mrope_section: list[int]#
‘field(…)’
- rotary_base: float#
1000000
- spatial_merge_size: int#
2
- temporal_patch_size: int#
2
- patch_size: int#
14
- scatter_embedding_sequence_parallel: bool#
False
- position_id_per_seconds: int#
25
- seconds_per_chunk: int#
2
- freeze_language_model: bool#
False
- freeze_vision_model: bool#
False
- freeze_audio_model: bool#
False
- language_max_sequence_length: int#
2048
- persist_layer_norm: bool#
True
- bias_activation_fusion: bool#
True
- bias_dropout_fusion: bool#
True
- masked_softmax_fusion: bool#
False
- deallocate_pipeline_outputs: bool#
True
- async_tensor_model_parallel_allreduce: bool#
True
- distribute_saved_activations: bool#
False
- cp_comm_type: str#
‘p2p’
- provide(pre_process=None, post_process=None, vp_stage=None)#
Provide a Qwen2.5 Omni model instance with vision, audio, and language components.
- provide_language_model(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide just the language model component without vision/audio.