> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.deepseek_v32.layers

DeepSeek V3.2 Layers.

Contains the DeepseekV32Indexer for top-k sparse attention selection
and DeepseekV32MLA which integrates the indexer with Multi-head Latent Attention.

## Module Contents

### Classes

| Name                                                                                             | Description                                                    |
| ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------- |
| [`DeepseekV32Indexer`](#nemo_automodel-components-models-deepseek_v32-layers-DeepseekV32Indexer) | Indexer for top-k sparse attention selection.                  |
| [`DeepseekV32MLA`](#nemo_automodel-components-models-deepseek_v32-layers-DeepseekV32MLA)         | Multi-head Latent Attention with Indexer for sparse attention. |

### Functions

| Name                                                                                                         | Description                                                                   |
| ------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------- |
| [`_rotate_activation`](#nemo_automodel-components-models-deepseek_v32-layers-_rotate_activation)             | Apply Hadamard rotation activation.                                           |
| [`hadamard_transform`](#nemo_automodel-components-models-deepseek_v32-layers-hadamard_transform)             | Fallback hadamard\_transform when fast\_hadamard\_transform is not available. |
| [`hadamard_transform_torch`](#nemo_automodel-components-models-deepseek_v32-layers-hadamard_transform_torch) | Multiply H\_n @ u where H\_n is the Hadamard matrix of dimension n x n.       |

### Data

[`_FAST_HADAMARD_AVAILABLE`](#nemo_automodel-components-models-deepseek_v32-layers-_FAST_HADAMARD_AVAILABLE)

### API

```python
class nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer(
    config: nemo_automodel.components.models.deepseek_v32.config.DeepseekV32Config,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Indexer for top-k sparse attention selection.

Based on the official DeepSeek V3.2 training implementation. Computes attention
scores between queries and keys with per-head weights, applies ReLU activation,
then selects the top-k positions to attend to.

Key features:

* Uses LayerNorm (not RMSNorm) for key normalization
* Has a weights\_proj that learns per-head importance weights
* Optional Hadamard transform (rotate\_activation) on Q and K
* ReLU activation on attention scores before weighting

```python
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer.forward(
    x: torch.Tensor,
    q_resid: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

Compute top-k indices for sparse attention.

**Parameters:**

Hidden states \[B, S, hidden] or \[T, hidden] for thd format

Q lora residual from MLA \[B, S, q\_lora\_rank] or \[T, q\_lora\_rank]

RoPE frequencies

Optional attention mask

Additional attention kwargs (cu\_seqlens, etc.)

**Returns:** `torch.Tensor`

Indices of top-k positions \[B, S, topk] or \[T, topk]

```python
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer.init_weights(
    init_std: float = 0.02
)
```

```python
class nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA(
    config: nemo_automodel.components.models.deepseek_v32.config.DeepseekV32Config,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Multi-head Latent Attention with Indexer for sparse attention.

This extends the V3 MLA with an Indexer module that performs
top-k selection for sparse attention. The indexer uses the
q\_lora residual and hidden states to compute which positions
to attend to.

```python
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA._build_sparse_mask(
    topk_indices: torch.Tensor,
    seq_len: int,
    qkv_format: str,
    bsz: int = 1,
    n_heads: int = 1,
    dtype: torch.dtype = torch.bfloat16,
    attention_mask: torch.Tensor | None = None,
    union_across_batches: bool = False,
    as_bool: bool = False
) -> torch.Tensor
```

Build a sparse attention mask/bias from top-k indices.

Creates either an additive mask where non-top-k positions are set to
`-inf` or a boolean keep-mask. TE consumes the additive mask as
`core_attention_bias`; SDPA consumes the boolean mask to avoid bf16
additive-mask leakage in fused kernels.

Uses the same efficient pattern as the official DeepSeek inference code:
`torch.full(..., -inf).scatter_(-1, topk_indices, 0)`

**Parameters:**

Indices of top-k positions \[B, S, topk] or \[T, topk]

Sequence length

'bshd' or 'thd'

Batch size (only used for bshd format)

Number of attention heads to expand to

Data type for the output tensor

Optional attention mask to combine with (for SDPA)

If True, union top-k across batches (for TE);
if False, keep per-batch masks (for SDPA)

If True, return a boolean keep-mask (True = attend).

**Returns:** `torch.Tensor`

Mask tensor with shape:

* \[1, n\_heads, S, S] if union\_across\_batches=True
* \[B, n\_heads, S, S] if union\_across\_batches=False (bshd)
* \[1, n\_heads, T, T] for thd format

```python
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
)
```

```python
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA.init_weights(
    _buffer_device: torch.device,
    init_std: float = 0.02
)
```

```python
nemo_automodel.components.models.deepseek_v32.layers._rotate_activation(
    x: torch.Tensor
) -> torch.Tensor
```

Apply Hadamard rotation activation.

**Parameters:**

Input tensor (must be bfloat16).

**Returns:** `torch.Tensor`

Rotated tensor.

```python
nemo_automodel.components.models.deepseek_v32.layers.hadamard_transform(
    x: torch.Tensor,
    scale: float
) -> torch.Tensor
```

Fallback hadamard\_transform when fast\_hadamard\_transform is not available.

```python
nemo_automodel.components.models.deepseek_v32.layers.hadamard_transform_torch(
    u,
    scale: float,
    normalize = False
)
```

Multiply H\_n @ u where H\_n is the Hadamard matrix of dimension n x n.
n must be a power of 2.
Parameters:
u: Tensor of shape (..., n)
normalize: if True, divide the result by 2^\{m/2} where m = log\_2(n).
Returns:
product: Tensor of shape (..., n)

```python
nemo_automodel.components.models.deepseek_v32.layers._FAST_HADAMARD_AVAILABLE = True
```