MiMo-V2-Flash
MiMo-V2-Flash
MiMo-V2-Flash is Xiaomi’s hybrid attention Mixture-of-Experts language model. It alternates full and sliding-window attention layers, uses a sigmoid_with_bias router with group-limited expert routing, and ships as an FP8 HF checkpoint.
Available Models
- MiMo-V2-Flash: hybrid full/sliding-window attention with FP8 weights.
Architecture
MiMoV2FlashForCausalLM- Sliding-window attention via the
MiMoV2FlashAttention(is_swa=True)path. - MoE blocks use the shared
nemo_automodel.components.moe.layers.MoEwithscore_func="sigmoid_with_bias"andgate_precision=fp32so routing decisions stay numerically stable when activations are bf16. - FP8 round-trip in
MiMoV2FlashStateDictAdaptercovers the bulk of attention/expert weights; layer norms, the gate,lm_head, andembed_tokensstay in bf16 perNON_QUANTIZED_KEY_PATTERNS.
Example HF Models
Example Recipes
Try with NeMo AutoModel
1. Install (full instructions):
2. Clone the repo to get the example recipes:
3. Run the recipe from inside the repo:
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory:
3. Run the recipe:
See the Installation Guide and LLM Fine-Tuning Guide.
Fine-Tuning
See the LLM Fine-Tuning Guide.