Qwen3-Omni#
Qwen3-Omni is a multimodal Qwen family model with text, image, video, and audio inputs. Megatron Bridge support for Qwen3-Omni reuses the existing Qwen3-VL language and vision path, and adds Qwen3-Omni-specific audio handling and checkpoint mappings.
The current implementation focuses on checkpoint conversion, training-oriented multimodal forward paths, and smoke-level validation. It includes a full example workflow (HF -> Megatron -> HF export, single-rank inference) and a multi-node training recipe entrypoint.
Current Support#
Hugging Face to Megatron Bridge checkpoint conversion for
Qwen/Qwen3-Omni-30B-A3B-InstructMegatron Bridge to Hugging Face export for the same model family
Text, image, video, and audio multimodal forward paths
Qwen3-Omni-specific multimodal RoPE handling for Megatron Bridge runtime
Single-GPU smoke validation with a vertically trimmed checkpoint
Multi-node training recipe entrypoint (see Qwen3-Omni examples)
L0 conversion test coverage for Qwen3-Omni
Known Limitations#
Megatron inference with
inference_paramsis not implemented yetpacked_seq_paramsis not implemented yetAutomated validation coverage remains single-rank; multi-node training requires user execution
Functional smoke tests require user-provided local multimodal assets
Hugging Face Model Cards#
Qwen3-Omni-30B-A3B-Instruct:
https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
Examples#
Qwen3-Omni examples are maintained here:
examples/models/vlm/qwen3_omni/README.mdexamples/models/vlm/qwen3_omni/conversion.shexamples/models/vlm/qwen3_omni/inference.shexamples/models/vlm/qwen3_omni/local_train_thinker_4node_tp2_ep8_sp.sh