Xiaomi-MiMo#

Xiaomi-MiMo models use a Qwen2-style causal language backbone with Multi-Token Prediction support. Megatron Bridge supports MiMo causal language models through MimoBridge, which extends the Qwen2 bridge and adds MTP weight mappings.

Supported Variants#

Megatron Bridge supports Hugging Face checkpoints using the MiMoForCausalLM architecture and mimo model type.

Architecture Notes#

  • Qwen2-style attention behavior with QKV bias enabled.

  • Optional MTP layers are enabled from num_nextn_predict_layers in the Hugging Face config.

  • The bridge maps MTP token/hidden layernorms, projection layers, attention weights, and gated MLP weights.

  • Input projection halves are swapped during import/export to match Megatron and Hugging Face layouts.

Examples#

General MiMo and heterogeneous multimodal training examples live under examples/megatron_mimo.