Qwen3-Omni#

Qwen3-Omni is a multimodal Qwen family model with text, image, video, and audio inputs. Megatron Bridge support for Qwen3-Omni reuses the existing Qwen3-VL language and vision path, and adds Qwen3-Omni-specific audio handling and checkpoint mappings.

The current implementation focuses on checkpoint conversion, training-oriented multimodal forward paths, and smoke-level validation. It includes a full example workflow (HF -> Megatron -> HF export, single-rank inference) and a multi-node training recipe entrypoint.

Current Support#

  • Hugging Face to Megatron Bridge checkpoint conversion for Qwen/Qwen3-Omni-30B-A3B-Instruct

  • Megatron Bridge to Hugging Face export for the same model family

  • Text, image, video, and audio multimodal forward paths

  • Qwen3-Omni-specific multimodal RoPE handling for Megatron Bridge runtime

  • Single-GPU smoke validation with a vertically trimmed checkpoint

  • Multi-node training recipe entrypoint (see Qwen3-Omni examples)

  • L0 conversion test coverage for Qwen3-Omni

Known Limitations#

  • Megatron inference with inference_params is not implemented yet

  • packed_seq_params is not implemented yet

  • Automated validation coverage remains single-rank; multi-node training requires user execution

  • Functional smoke tests require user-provided local multimodal assets

Hugging Face Model Cards#

  • Qwen3-Omni-30B-A3B-Instruct: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

Examples#

Qwen3-Omni examples are maintained here:

  • examples/models/vlm/qwen3_omni/README.md

  • examples/models/vlm/qwen3_omni/conversion.sh

  • examples/models/vlm/qwen3_omni/inference.sh

  • examples/models/vlm/qwen3_omni/local_train_thinker_4node_tp2_ep8_sp.sh