nemo_automodel.components.models.minimax_m3_vl.layers

View as Markdown

MiniMax M3 VL text-backbone layers.

Stage 1 covers the dense + MoE text path (no sparse-attention index branch and no MTP). Mirrors the canonical sglang reference sglang.srt.models.minimax_m3 (MiniMaxM3Attention / MiniMaxM3MLP / MiniMaxM3MoE / MiniMaxM3DecoderLayer):

  • per-head Gemma RMSNorm on Q/K (qk_norm_type='per_head', use_gemma_norm=True),
  • partial RoPE (rotary_dim=64 of head_dim=128) reusing the gpt_oss rotary utilities (as the existing minimax_m2 backbone does),
  • SwiGLU-OAI activation gate * sigmoid(alpha * gate) * (up + 1) with gate clamped max=limit and up clamped +/-limit for dense and shared experts,
  • per-layer dense-vs-MoE selection from moe_layer_freq.

Module Contents

Classes

NameDescription
BlockMiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.
MiniMaxM3AttentionMiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.
MiniMaxM3IndexerLightning indexer (selection-only) for MiniMax M3 sparse-attention layers.
MiniMaxM3MLPDense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down).
MiniMaxM3RMSNormRMSNorm with optional Gemma-style zero-centered gamma (x_normed * (1 + w)).

Functions

NameDescription
_padding_mask_to_additive_biasConvert an incoming attention mask to an additive key bias broadcastable to ref.
build_block_sparse_attn_biasBuild the additive block-sparse causal attention bias from index q/k.
swiglu_oaiGPT-OSS / MiniMax-M3 SwiGLU-OAI: gate * sigmoid(alpha * gate) * (up + 1).

API

class nemo_automodel.components.models.minimax_m3_vl.layers.Block(
layer_idx: int,
config: typing.Any,
moe_config: nemo_automodel.components.moe.layers.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

MiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.

moe_layer_freq[layer_idx] == 0 -> dense MiniMaxM3MLP (with dense_intermediate_size); otherwise a routed MoE plus a separate SwiGLU-OAI shared expert (kept M3-local rather than using MoE’s built-in shared expert, whose generic MLP does not implement SwiGLU-OAI).

input_layernorm
is_moe_layer
mlp
= MoE(moe_config, backend)
post_attention_layernorm
self_attn
shared_experts
= MiniMaxM3MLP(config, shared_inter, backend)
nemo_automodel.components.models.minimax_m3_vl.layers.Block.forward(
x: torch.Tensor,
freqs_cis: torch.Tensor,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.minimax_m3_vl.layers.Block.init_weights(
buffer_device: torch.device
)
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention(
config: typing.Any,
backend: nemo_automodel.components.models.common.BackendConfig,
is_sparse_attention_layer: bool = False
)

Bases: Module

MiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.

When is_sparse_attention_layer is set, an additional lightning indexer (index_q/k_proj + per-head Gemma norm) selects, per query, the top-k key blocks to attend to (block-level DeepSeek-style sparse attention). M3 sets disable_index_value=True so the index branch is selection-only.

head_dim
indexer
k_norm
k_proj
num_heads
= config.num_attention_heads
num_kv_heads
= config.num_key_value_heads
o_proj
q_norm
q_proj
use_qk_norm
= getattr(config, 'use_qk_norm', False)
v_proj
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention.forward(
x: torch.Tensor,
freqs_cis: torch.Tensor,
attention_mask: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention.init_weights(
buffer_device: torch.device,
init_std: float = 0.02
)
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer(
config: typing.Any,
sparse_cfg: dict,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Lightning indexer (selection-only) for MiniMax M3 sparse-attention layers.

Projects hidden states to num_index_heads index queries and a single shared index key (disable_index_value=True for M3, so there is no index value/output projection). Per-head Gemma RMSNorm + partial RoPE mirror the main attention. The produced idx_q/idx_k feed :func:build_block_sparse_attn_bias to select which key blocks each query attends to.

block_size
= sparse_cfg['sparse_block_size']
index_head_dim
= sparse_cfg['sparse_index_dim']
index_k_norm
index_k_proj
index_q_norm
index_q_proj
init_blocks
= sparse_cfg.get('sparse_init_block', 0)
local_blocks
= sparse_cfg.get('sparse_local_block', 1)
num_index_heads
= sparse_cfg['sparse_num_index_heads']
score_type
= sparse_cfg.get('sparse_score_type', 'max')
topk_blocks
= sparse_cfg['sparse_topk_blocks']
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer.forward(
x: torch.Tensor,
freqs_cis: torch.Tensor,
num_q_heads: int,
attn_kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer.init_weights(
buffer_device: torch.device,
init_std: float = 0.02
)
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP(
config: typing.Any,
intermediate_size: int,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Dense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down).

alpha
= float(getattr(config, 'swiglu_alpha', 1.702))
down_proj
gate_proj
limit
= float(getattr(config, 'swiglu_limit', 7.0))
up_proj
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP.forward(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP.init_weights(
buffer_device: torch.device,
init_std: float = 0.02
) -> None
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm(
dim: int,
eps: float = 1e-06,
gemma: bool = True
)

Bases: Module

RMSNorm with optional Gemma-style zero-centered gamma (x_normed * (1 + w)).

When gemma=True the learnable weight is centered at 0 and the effective scale is 1 + weight (matching HF GemmaRMSNorm and the sglang M3 reference). Used both for hidden-size norms and, with dim=head_dim, for per-head Q/K normalization (the input is normalized over its last dim, so a [..., num_heads, head_dim] tensor is normalized independently per head).

weight
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm.forward(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm.reset_parameters() -> None
nemo_automodel.components.models.minimax_m3_vl.layers._padding_mask_to_additive_bias(
attention_mask: torch.Tensor,
ref: torch.Tensor
) -> torch.Tensor

Convert an incoming attention mask to an additive key bias broadcastable to ref.

Accepts a 2-D [B, T] keep-mask (1/True = attend) or an already-additive float mask; returns 0 where attended and -inf where masked.

nemo_automodel.components.models.minimax_m3_vl.layers.build_block_sparse_attn_bias(
idx_q: torch.Tensor,
idx_k: torch.Tensor,
block_size: int,
topk_blocks: int,
init_blocks: int,
local_blocks: int,
num_q_heads: int,
score_type: str = 'max'
) -> torch.Tensor

Build the additive block-sparse causal attention bias from index q/k.

Mirrors the sglang minimax_sparse selection (block_size_q=1 -> per-query-position): the index score for (query i, key j) is (idx_q[i] . idx_k[j]) * idx_dim**-0.5 with causal masking; keys are grouped into blocks of block_size and reduced per block (max or lse). For each query, the current block (local_blocks) and the first init_blocks are always kept and the remaining budget is filled with the highest-scoring causal blocks, up to min(topk_blocks, valid_blocks).

Parameters:

idx_q
torch.Tensor

[B, T, H_idx, D] index queries (post norm + RoPE).

idx_k
torch.Tensor

[B, T, 1, D] shared index key (post norm + RoPE).

num_q_heads
int

number of main attention heads; the per-idx-head bias is expanded num_q_heads // H_idx times (GQA, repeat-interleave).

Returns: torch.Tensor

[B, num_q_heads, T, T] float bias (0 where attended, -inf

nemo_automodel.components.models.minimax_m3_vl.layers.swiglu_oai(
gate: torch.Tensor,
up: torch.Tensor,
alpha: float,
limit: float
) -> torch.Tensor

GPT-OSS / MiniMax-M3 SwiGLU-OAI: gate * sigmoid(alpha * gate) * (up + 1).

Gate is clamped max=limit and up is clamped +/-limit (when limit > 0), computed in fp32 and cast back. Equivalent to sglang’s swiglu_no_interleaved_with_alpha_and_limit.