nemo_automodel.components.models.minimax_m3_vl.layers

MiniMax M3 VL text-backbone layers.

Stage 1 covers the dense + MoE text path (no sparse-attention index branch and no MTP). Mirrors the canonical sglang reference sglang.srt.models.minimax_m3 (MiniMaxM3Attention / MiniMaxM3MLP / MiniMaxM3MoE / MiniMaxM3DecoderLayer):

per-head Gemma RMSNorm on Q/K (qk_norm_type='per_head', use_gemma_norm=True),
partial RoPE (rotary_dim=64 of head_dim=128) reusing the gpt_oss rotary utilities (as the existing minimax_m2 backbone does),
SwiGLU-OAI activation gate * sigmoid(alpha * gate) * (up + 1) with gate clamped max=limit and up clamped +/-limit for dense and shared experts,
per-layer dense-vs-MoE selection from moe_layer_freq.

Module Contents

Classes

Name	Description
`Block`	MiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.
`MiniMaxM3Attention`	MiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.
`MiniMaxM3Indexer`	Lightning indexer (selection-only) for MiniMax M3 sparse-attention layers.
`MiniMaxM3MLP`	Dense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down).
`MiniMaxM3RMSNorm`	RMSNorm with optional Gemma-style zero-centered gamma (`x_normed * (1 + w)`).

Functions

Name	Description
`_padding_mask_to_additive_bias`	Convert an incoming attention mask to an additive key bias broadcastable to `ref`.
`build_block_sparse_attn_bias`	Build the additive block-sparse causal attention bias from index q/k.
`swiglu_oai`	GPT-OSS / MiniMax-M3 SwiGLU-OAI: `gate * sigmoid(alpha * gate) * (up + 1)`.

API

class nemo_automodel.components.models.minimax_m3_vl.layers.Block(
    layer_idx: int,
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

MiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.

moe_layer_freq[layer_idx] == 0 -> dense MiniMaxM3MLP (with dense_intermediate_size); otherwise a routed MoE plus a separate SwiGLU-OAI shared expert (kept M3-local rather than using MoE’s built-in shared expert, whose generic MLP does not implement SwiGLU-OAI).

input_layernorm

is_moe_layer

mlp

= MoE(moe_config, backend)

post_attention_layernorm

self_attn

shared_experts

= MiniMaxM3MLP(config, shared_inter, backend)

nemo_automodel.components.models.minimax_m3_vl.layers.Block.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.minimax_m3_vl.layers.Block.init_weights(
    buffer_device: torch.device
)

class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention(
    config: typing.Any,
    backend: nemo_automodel.components.models.common.BackendConfig,
    is_sparse_attention_layer: bool = False
)

Bases: Module

MiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.

When is_sparse_attention_layer is set, an additional lightning indexer (index_q/k_proj + per-head Gemma norm) selects, per query, the top-k key blocks to attend to (block-level DeepSeek-style sparse attention). M3 sets disable_index_value=True so the index branch is selection-only.

head_dim

indexer

k_norm

k_proj

num_heads

= config.num_attention_heads

num_kv_heads

= config.num_key_value_heads

o_proj

q_norm

q_proj

use_qk_norm

= getattr(config, 'use_qk_norm', False)

v_proj

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
)

class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer(
    config: typing.Any,
    sparse_cfg: dict,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Lightning indexer (selection-only) for MiniMax M3 sparse-attention layers.

Projects hidden states to num_index_heads index queries and a single shared index key (disable_index_value=True for M3, so there is no index value/output projection). Per-head Gemma RMSNorm + partial RoPE mirror the main attention. The produced idx_q/idx_k feed :func:build_block_sparse_attn_bias to select which key blocks each query attends to.

block_size

= sparse_cfg['sparse_block_size']

index_head_dim

= sparse_cfg['sparse_index_dim']

index_k_norm

index_k_proj

index_q_norm

index_q_proj

init_blocks

= sparse_cfg.get('sparse_init_block', 0)

local_blocks

= sparse_cfg.get('sparse_local_block', 1)

num_index_heads

= sparse_cfg['sparse_num_index_heads']

score_type

= sparse_cfg.get('sparse_score_type', 'max')

topk_blocks

= sparse_cfg['sparse_topk_blocks']

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    num_q_heads: int,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
)

class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP(
    config: typing.Any,
    intermediate_size: int,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Dense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down).

alpha

= float(getattr(config, 'swiglu_alpha', 1.702))

down_proj

gate_proj

limit

= float(getattr(config, 'swiglu_limit', 7.0))

up_proj

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP.forward(
    x: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None

class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm(
    dim: int,
    eps: float = 1e-06,
    gemma: bool = True
)

Bases: Module

RMSNorm with optional Gemma-style zero-centered gamma (x_normed * (1 + w)).

When gemma=True the learnable weight is centered at 0 and the effective scale is 1 + weight (matching HF GemmaRMSNorm and the sglang M3 reference). Used both for hidden-size norms and, with dim=head_dim, for per-head Q/K normalization (the input is normalized over its last dim, so a [..., num_heads, head_dim] tensor is normalized independently per head).

weight

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm.forward(
    x: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm.reset_parameters() -> None

nemo_automodel.components.models.minimax_m3_vl.layers._padding_mask_to_additive_bias(
    attention_mask: torch.Tensor,
    ref: torch.Tensor
) -> torch.Tensor

Convert an incoming attention mask to an additive key bias broadcastable to ref.

Accepts a 2-D [B, T] keep-mask (1/True = attend) or an already-additive float mask; returns 0 where attended and -inf where masked.

nemo_automodel.components.models.minimax_m3_vl.layers.build_block_sparse_attn_bias(
    idx_q: torch.Tensor,
    idx_k: torch.Tensor,
    block_size: int,
    topk_blocks: int,
    init_blocks: int,
    local_blocks: int,
    num_q_heads: int,
    score_type: str = 'max'
) -> torch.Tensor

Build the additive block-sparse causal attention bias from index q/k.

Mirrors the sglang minimax_sparse selection (block_size_q=1 -> per-query-position): the index score for (query i, key j) is (idx_q[i] . idx_k[j]) * idx_dim**-0.5 with causal masking; keys are grouped into blocks of block_size and reduced per block (max or lse). For each query, the current block (local_blocks) and the first init_blocks are always kept and the remaining budget is filled with the highest-scoring causal blocks, up to min(topk_blocks, valid_blocks).

Parameters:

idx_q

torch.Tensor

[B, T, H_idx, D] index queries (post norm + RoPE).

idx_k

torch.Tensor

[B, T, 1, D] shared index key (post norm + RoPE).

num_q_heads

int

number of main attention heads; the per-idx-head bias is expanded num_q_heads // H_idx times (GQA, repeat-interleave).

Returns: torch.Tensor

[B, num_q_heads, T, T] float bias (0 where attended, -inf

nemo_automodel.components.models.minimax_m3_vl.layers.swiglu_oai(
    gate: torch.Tensor,
    up: torch.Tensor,
    alpha: float,
    limit: float
) -> torch.Tensor

GPT-OSS / MiniMax-M3 SwiGLU-OAI: gate * sigmoid(alpha * gate) * (up + 1).

Gate is clamped max=limit and up is clamped +/-limit (when limit > 0), computed in fp32 and cast back. Equivalent to sglang’s swiglu_no_interleaved_with_alpha_and_limit.