Qwen2.5-Omni#
Qwen2.5-Omni is a multimodal Qwen model for image, video, audio, and text understanding. Megatron Bridge supports it through the Qwen Omni bridge.
Supported Variants#
Qwen2.5-Omni-7B: https://huggingface.co/Qwen/Qwen2.5-Omni-7B
Architecture Notes#
Dense Qwen2 language backbone with multimodal RoPE.
Vision and audio inputs are routed through Qwen2.5-Omni multimodal components.
Video-with-audio inference depends on
qwen-omni-utils[decord]and an availableffmpegbinary.
Examples#
For checkpoint import/export, round-trip validation, multimodal inference, and dependency notes, see the Qwen2.5-Omni examples README.