nemo_automodel.components.models.minimax_m3_vl.layers
nemo_automodel.components.models.minimax_m3_vl.layers
MiniMax M3 VL text-backbone layers.
Stage 1 covers the dense + MoE text path (no sparse-attention index branch and
no MTP). Mirrors the canonical sglang reference
sglang.srt.models.minimax_m3 (MiniMaxM3Attention / MiniMaxM3MLP /
MiniMaxM3MoE / MiniMaxM3DecoderLayer):
- per-head Gemma RMSNorm on Q/K (
qk_norm_type='per_head',use_gemma_norm=True), - partial RoPE (
rotary_dim=64ofhead_dim=128) reusing the gpt_oss rotary utilities (as the existingminimax_m2backbone does), - SwiGLU-OAI activation
gate * sigmoid(alpha * gate) * (up + 1)with gate clampedmax=limitand up clamped+/-limitfor dense and shared experts, - per-layer dense-vs-MoE selection from
moe_layer_freq.
Module Contents
Classes
Functions
API
Bases: Module
MiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.
moe_layer_freq[layer_idx] == 0 -> dense MiniMaxM3MLP (with
dense_intermediate_size); otherwise a routed MoE plus a separate
SwiGLU-OAI shared expert (kept M3-local rather than using MoE’s built-in
shared expert, whose generic MLP does not implement SwiGLU-OAI).
Bases: Module
MiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.
When is_sparse_attention_layer is set, an additional lightning indexer
(index_q/k_proj + per-head Gemma norm) selects, per query, the top-k key
blocks to attend to (block-level DeepSeek-style sparse attention). M3 sets
disable_index_value=True so the index branch is selection-only.
Bases: Module
Lightning indexer (selection-only) for MiniMax M3 sparse-attention layers.
Projects hidden states to num_index_heads index queries and a single
shared index key (disable_index_value=True for M3, so there is no index
value/output projection). Per-head Gemma RMSNorm + partial RoPE mirror the
main attention. The produced idx_q/idx_k feed
:func:build_block_sparse_attn_bias to select which key blocks each query
attends to.
Bases: Module
Dense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down).
Bases: Module
RMSNorm with optional Gemma-style zero-centered gamma (x_normed * (1 + w)).
When gemma=True the learnable weight is centered at 0 and the effective
scale is 1 + weight (matching HF GemmaRMSNorm and the sglang M3
reference). Used both for hidden-size norms and, with dim=head_dim, for
per-head Q/K normalization (the input is normalized over its last dim, so a
[..., num_heads, head_dim] tensor is normalized independently per head).
Convert an incoming attention mask to an additive key bias broadcastable to ref.
Accepts a 2-D [B, T] keep-mask (1/True = attend) or an already-additive
float mask; returns 0 where attended and -inf where masked.
Build the additive block-sparse causal attention bias from index q/k.
Mirrors the sglang minimax_sparse selection (block_size_q=1 ->
per-query-position): the index score for (query i, key j) is
(idx_q[i] . idx_k[j]) * idx_dim**-0.5 with causal masking; keys are
grouped into blocks of block_size and reduced per block (max or
lse). For each query, the current block (local_blocks) and the first
init_blocks are always kept and the remaining budget is filled with the
highest-scoring causal blocks, up to min(topk_blocks, valid_blocks).
Parameters:
[B, T, H_idx, D] index queries (post norm + RoPE).
[B, T, 1, D] shared index key (post norm + RoPE).
number of main attention heads; the per-idx-head bias is
expanded num_q_heads // H_idx times (GQA, repeat-interleave).
Returns: torch.Tensor
[B, num_q_heads, T, T] float bias (0 where attended, -inf
GPT-OSS / MiniMax-M3 SwiGLU-OAI: gate * sigmoid(alpha * gate) * (up + 1).
Gate is clamped max=limit and up is clamped +/-limit (when
limit > 0), computed in fp32 and cast back. Equivalent to sglang’s
swiglu_no_interleaved_with_alpha_and_limit.