> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.nemotron_v3.layers

## Module Contents

### Classes

| Name                                                                                                              | Description                                                               |
| ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| [`NemotronV3Attention`](#nemo_automodel-components-models-nemotron_v3-layers-NemotronV3Attention)                 | GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends. |
| [`NemotronV3Block`](#nemo_automodel-components-models-nemotron_v3-layers-NemotronV3Block)                         | NemotronV3 decoder block (training-only, simplified).                     |
| [`NemotronV3Mamba2Mixer`](#nemo_automodel-components-models-nemotron_v3-layers-NemotronV3Mamba2Mixer)             | Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).           |
| [`NemotronV3MambaRMSNormGated`](#nemo_automodel-components-models-nemotron_v3-layers-NemotronV3MambaRMSNormGated) | Gated RMSNorm for Mamba layers.                                           |

### API

```python
class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
)
```

**Bases:** `Module`

GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention.forward(
    hidden_states: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    past_key_values = None,
    layer_idx: int | None = None,
    attn_kwargs = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention.init_weights(
    num_hidden_layers: int,
    rescale_prenorm_residual: bool = True,
    buffer_device: torch.device | None = None
) -> None
```

```python
class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block(
    config,
    layer_idx: int,
    moe_config = None,
    backend = None,
    block_type: str | None = None
)
```

**Bases:** `Module`

NemotronV3 decoder block (training-only, simplified).

Pre-norm architecture: norm → mixer → residual add
Supports hybrid layer types: Mamba, Attention, MLP, MoE

Map block\_type to MoE parallelizer's layer\_type convention.

Return mixer for MoE blocks for compatibility with parallelizer.

Alias for mixer, for compatibility with MoE parallelizer.

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block.forward(
    hidden_states: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    past_key_values = None,
    cache_position: torch.LongTensor | None = None,
    attn_kwargs = {}
) -> torch.Tensor
```

Forward pass through the block.

**Parameters:**

Input tensor of shape (batch, seq\_len, hidden\_size)

Mask tensor - type depends on layer:

* For attention: 4D causal mask \[batch, 1, seq\_len, seq\_len]
* For mamba: 2D padding mask \[batch, seq\_len]
* For mlp/moe: None

Optional NemotronHybridCache for KV/SSM caching.

Token position indices for cache updates.

Additional keyword arguments forwarded to attention layers
only (e.g. cu\_seqlens, cp\_size, cp\_rank for Context Parallelism).

**Returns:** `torch.Tensor`

Output tensor of shape (batch, seq\_len, hidden\_size)

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

Initialize block weights following NemotronV3 spec.

**Parameters:**

Device for buffer initialization (used by MLP/MoE)

```python
class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer(
    config,
    layer_idx: int
)
```

**Bases:** `Module`

Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).

This implementation uses the fused mamba\_split\_conv1d\_scan\_combined kernel
for maximum training efficiency. Does not support inference caching.

Requires mamba\_ssm and causal\_conv1d packages.

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer.forward(
    hidden_states: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    past_key_values = None,
    cache_position: torch.LongTensor | None = None,
    kwargs = {}
) -> torch.Tensor
```

Forward pass with three code paths.

Path A (training): past\_key\_values is None → fully-fused kernel.
Path B (prefill): past\_key\_values present, seq\_len > 1 → unfused scan + cache init.
Path C (decode): past\_key\_values present, seq\_len == 1, has\_previous\_state → single-step update.

**Parameters:**

Input tensor of shape (batch, seq\_len, hidden\_size)

Optional attention mask (applied to padding)

Optional NemotronHybridCache instance.

Token positions for cache updates.

**Returns:** `torch.Tensor`

Output tensor of shape (batch, seq\_len, hidden\_size)

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer.init_weights(
    num_hidden_layers: int,
    rescale_prenorm_residual: bool = True,
    buffer_device: torch.device | None = None
) -> None
```

Initialize Mamba2Mixer weights following NemotronV3 spec.

```python
class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated(
    hidden_size: int,
    group_size: int,
    eps: float = 1e-05
)
```

**Bases:** `Module`

Gated RMSNorm for Mamba layers.

Uses the fused triton kernel from mamba\_ssm for efficiency.

```python
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated.forward(
    hidden_states: torch.Tensor,
    gate: torch.Tensor | None = None
) -> torch.Tensor
```