Xiaomi-MiMo#
Xiaomi-MiMo models use a Qwen2-style causal language backbone with Multi-Token Prediction support. Megatron Bridge supports MiMo causal language models through MimoBridge, which extends the Qwen2 bridge and adds MTP weight mappings.
Supported Variants#
Megatron Bridge supports Hugging Face checkpoints using the MiMoForCausalLM architecture and mimo model type.
Architecture Notes#
Qwen2-style attention behavior with QKV bias enabled.
Optional MTP layers are enabled from
num_nextn_predict_layersin the Hugging Face config.The bridge maps MTP token/hidden layernorms, projection layers, attention weights, and gated MLP weights.
Input projection halves are swapped during import/export to match Megatron and Hugging Face layouts.
Examples#
General MiMo and heterogeneous multimodal training examples live under examples/megatron_mimo.