bridge.models.mimo_v2_flash.mimo_v2_flash_provider#
MiMo-V2-Flash Model Provider with dual-base RoPE.
The hybrid attention pattern (full vs SWA per layer) and per-layer KV head switching are handled by storing config on the provider.
Module Contents#
Classes#
Configuration and provider for MiMo-V2-Flash models. |
API#
- class bridge.models.mimo_v2_flash.mimo_v2_flash_provider.MiMoV2FlashModelProvider#
Bases:
megatron.bridge.models.gpt_provider.GPTModelProviderConfiguration and provider for MiMo-V2-Flash models.
Extends GPTModelProvider with MiMo-V2-Flash-specific fields that need to persist in run_config.yaml and be accessible to custom modules.
The hybrid attention pattern, per-layer KV heads, and dual RoPE bases are stored here. The
provide()override replaces the standard RoPE with a dual-base version (same pattern as Gemma3ModelProvider).- transformer_layer_spec: Union[megatron.core.transformer.ModuleSpec, Callable[[bridge.models.mimo_v2_flash.mimo_v2_flash_provider.MiMoV2FlashModelProvider], megatron.core.transformer.ModuleSpec]]#
‘field(…)’
- hybrid_attention_pattern: Optional[List[int]]#
None
- window_size: Union[int, tuple, None]#
128
- rotary_base: Union[int, float, tuple]#
(10000, 5000000)
- full_attn_num_query_groups: int#
4
- swa_num_query_groups: int#
8
- v_head_dim: int#
128
- attention_value_scale: Optional[float]#
None
- normalization: str#
‘RMSNorm’
- gated_linear_unit: bool#
True
- add_bias_linear: bool#
False
- position_embedding_type: str#
‘rope’
False
- provide(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Configure and instantiate a Megatron Core GPT model for MiMo-V2-Flash.