> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.minimax_m3_vl.layers

MiniMax M3 VL text-backbone layers.

Stage 1 covers the dense + MoE text path (no sparse-attention index branch and
no MTP).  Mirrors the canonical sglang reference
`sglang.srt.models.minimax_m3` (`MiniMaxM3Attention` / `MiniMaxM3MLP` /
`MiniMaxM3MoE` / `MiniMaxM3DecoderLayer`):

* per-head **Gemma** RMSNorm on Q/K (`qk_norm_type='per_head'`,
  `use_gemma_norm=True`),
* partial RoPE (`rotary_dim=64` of `head_dim=128`) reusing the gpt\_oss
  rotary utilities (as the existing `minimax_m2` backbone does),
* SwiGLU-OAI activation `gate * sigmoid(alpha * gate) * (up + 1)` with gate
  clamped `max=limit` and up clamped `+/-limit` for dense and shared experts,
* per-layer dense-vs-MoE selection from `moe_layer_freq`.

## Module Contents

### Classes

| Name                                                                                              | Description                                                                   |
| ------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`Block`](#nemo_automodel-components-models-minimax_m3_vl-layers-Block)                           | MiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.    |
| [`MiniMaxM3Attention`](#nemo_automodel-components-models-minimax_m3_vl-layers-MiniMaxM3Attention) | MiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.       |
| [`MiniMaxM3Indexer`](#nemo_automodel-components-models-minimax_m3_vl-layers-MiniMaxM3Indexer)     | Lightning indexer (selection-only) for MiniMax M3 sparse-attention layers.    |
| [`MiniMaxM3MLP`](#nemo_automodel-components-models-minimax_m3_vl-layers-MiniMaxM3MLP)             | Dense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down). |
| [`MiniMaxM3RMSNorm`](#nemo_automodel-components-models-minimax_m3_vl-layers-MiniMaxM3RMSNorm)     | RMSNorm with optional Gemma-style zero-centered gamma (`x_normed * (1 + w)`). |

### Functions

| Name                                                                                                                      | Description                                                                        |
| ------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| [`_padding_mask_to_additive_bias`](#nemo_automodel-components-models-minimax_m3_vl-layers-_padding_mask_to_additive_bias) | Convert an incoming attention mask to an additive key bias broadcastable to `ref`. |
| [`build_block_sparse_attn_bias`](#nemo_automodel-components-models-minimax_m3_vl-layers-build_block_sparse_attn_bias)     | Build the additive block-sparse causal attention bias from index q/k.              |
| [`swiglu_oai`](#nemo_automodel-components-models-minimax_m3_vl-layers-swiglu_oai)                                         | GPT-OSS / MiniMax-M3 SwiGLU-OAI: `gate * sigmoid(alpha * gate) * (up + 1)`.        |

### API

```python
class nemo_automodel.components.models.minimax_m3_vl.layers.Block(
    layer_idx: int,
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

MiniMax M3 decoder block: attention + (dense MLP or MoE) with Gemma norms.

`moe_layer_freq[layer_idx] == 0` -> dense `MiniMaxM3MLP` (with
`dense_intermediate_size`); otherwise a routed `MoE` plus a separate
SwiGLU-OAI shared expert (kept M3-local rather than using `MoE`'s built-in
shared expert, whose generic `MLP` does not implement SwiGLU-OAI).

```python
nemo_automodel.components.models.minimax_m3_vl.layers.Block.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.minimax_m3_vl.layers.Block.init_weights(
    buffer_device: torch.device
)
```

```python
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention(
    config: typing.Any,
    backend: nemo_automodel.components.models.common.BackendConfig,
    is_sparse_attention_layer: bool = False
)
```

**Bases:** `Module`

MiniMax M3 GQA attention with per-head Gemma Q/K norm and partial RoPE.

When `is_sparse_attention_layer` is set, an additional lightning indexer
(`index_q/k_proj` + per-head Gemma norm) selects, per query, the top-k key
*blocks* to attend to (block-level DeepSeek-style sparse attention). M3 sets
`disable_index_value=True` so the index branch is selection-only.

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Attention.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
)
```

```python
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer(
    config: typing.Any,
    sparse_cfg: dict,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Lightning indexer (selection-only) for MiniMax M3 sparse-attention layers.

Projects hidden states to `num_index_heads` index queries and a single
shared index key (`disable_index_value=True` for M3, so there is no index
value/output projection). Per-head Gemma RMSNorm + partial RoPE mirror the
main attention. The produced `idx_q`/`idx_k` feed
:func:`build_block_sparse_attn_bias` to select which key blocks each query
attends to.

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    num_q_heads: int,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3Indexer.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
)
```

```python
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP(
    config: typing.Any,
    intermediate_size: int,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Dense / shared-expert MLP with SwiGLU-OAI activation (separate gate/up/down).

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3MLP.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
class nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm(
    dim: int,
    eps: float = 1e-06,
    gemma: bool = True
)
```

**Bases:** `Module`

RMSNorm with optional Gemma-style zero-centered gamma (`x_normed * (1 + w)`).

When `gemma=True` the learnable weight is centered at 0 and the effective
scale is `1 + weight` (matching HF `GemmaRMSNorm` and the sglang M3
reference). Used both for hidden-size norms and, with `dim=head_dim`, for
per-head Q/K normalization (the input is normalized over its last dim, so a
`[..., num_heads, head_dim]` tensor is normalized independently per head).

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.minimax_m3_vl.layers.MiniMaxM3RMSNorm.reset_parameters() -> None
```

```python
nemo_automodel.components.models.minimax_m3_vl.layers._padding_mask_to_additive_bias(
    attention_mask: torch.Tensor,
    ref: torch.Tensor
) -> torch.Tensor
```

Convert an incoming attention mask to an additive key bias broadcastable to `ref`.

Accepts a 2-D `[B, T]` keep-mask (1/True = attend) or an already-additive
float mask; returns `0` where attended and `-inf` where masked.

```python
nemo_automodel.components.models.minimax_m3_vl.layers.build_block_sparse_attn_bias(
    idx_q: torch.Tensor,
    idx_k: torch.Tensor,
    block_size: int,
    topk_blocks: int,
    init_blocks: int,
    local_blocks: int,
    num_q_heads: int,
    score_type: str = 'max'
) -> torch.Tensor
```

Build the additive block-sparse causal attention bias from index q/k.

Mirrors the sglang `minimax_sparse` selection (`block_size_q=1` ->
per-query-position): the index score for (query `i`, key `j`) is
`(idx_q[i] . idx_k[j]) * idx_dim**-0.5` with causal masking; keys are
grouped into blocks of `block_size` and reduced per block (`max` or
`lse`). For each query, the current block (`local_blocks`) and the first
`init_blocks` are always kept and the remaining budget is filled with the
highest-scoring causal blocks, up to `min(topk_blocks, valid_blocks)`.

**Parameters:**

`[B, T, H_idx, D]` index queries (post norm + RoPE).

`[B, T, 1, D]` shared index key (post norm + RoPE).

number of main attention heads; the per-idx-head bias is
expanded `num_q_heads // H_idx` times (GQA, repeat-interleave).

**Returns:** `torch.Tensor`

`[B, num_q_heads, T, T]` float bias (`0` where attended, `-inf`

```python
nemo_automodel.components.models.minimax_m3_vl.layers.swiglu_oai(
    gate: torch.Tensor,
    up: torch.Tensor,
    alpha: float,
    limit: float
) -> torch.Tensor
```

GPT-OSS / MiniMax-M3 SwiGLU-OAI: `gate * sigmoid(alpha * gate) * (up + 1)`.

Gate is clamped `max=limit` and up is clamped `+/-limit` (when
`limit &gt; 0`), computed in fp32 and cast back. Equivalent to sglang's
`swiglu_no_interleaved_with_alpha_and_limit`.