Qwen2.5-Omni#

Qwen2.5-Omni is a multimodal Qwen model for image, video, audio, and text understanding. Megatron Bridge supports it through the Qwen Omni bridge.

Supported Variants#

Qwen2.5-Omni-7B: https://huggingface.co/Qwen/Qwen2.5-Omni-7B

Architecture Notes#

Dense Qwen2 language backbone with multimodal RoPE.
Vision and audio inputs are routed through Qwen2.5-Omni multimodal components.
Video-with-audio inference depends on qwen-omni-utils[decord] and an available ffmpeg binary.

Examples#

For checkpoint import/export, round-trip validation, multimodal inference, and dependency notes, see the Qwen2.5-Omni examples README.