nemo_automodel.components.models.deepseek_v32.layers

DeepSeek V3.2 Layers.

Contains the DeepseekV32Indexer for top-k sparse attention selection and DeepseekV32MLA which integrates the indexer with Multi-head Latent Attention.

Module Contents

Classes

Name	Description
`DeepseekV32Indexer`	Indexer for top-k sparse attention selection.
`DeepseekV32MLA`	Multi-head Latent Attention with Indexer for sparse attention.

Functions

Name	Description
`_rotate_activation`	Apply Hadamard rotation activation.
`hadamard_transform`	Fallback hadamard_transform when fast_hadamard_transform is not available.
`hadamard_transform_torch`	Multiply H_n @ u where H_n is the Hadamard matrix of dimension n x n.

Data

_FAST_HADAMARD_AVAILABLE

API

class nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer(
    config: nemo_automodel.components.models.deepseek_v32.config.DeepseekV32Config,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Indexer for top-k sparse attention selection.

Based on the official DeepSeek V3.2 training implementation. Computes attention scores between queries and keys with per-head weights, applies ReLU activation, then selects the top-k positions to attend to.

Key features:

Uses LayerNorm (not RMSNorm) for key normalization
Has a weights_proj that learns per-head importance weights
Optional Hadamard transform (rotate_activation) on Q and K
ReLU activation on attention scores before weighting

head_dim

= config.index_head_dim

hidden_size

= config.hidden_size

index_topk

= config.index_topk

k_norm

= nn.LayerNorm(self.head_dim, dtype=dtype)

num_heads

= config.index_n_heads

q_lora_rank

= config.q_lora_rank

qk_nope_head_dim

= self.head_dim - self.qk_rope_head_dim

qk_rope_head_dim

= config.qk_rope_head_dim

softmax_scale

= self.head_dim ** -0.5

weights_proj

wq_b

nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer.forward(
    x: torch.Tensor,
    q_resid: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

Compute top-k indices for sparse attention.

Parameters:

torch.Tensor

Hidden states [B, S, hidden] or [T, hidden] for thd format

q_resid

torch.Tensor

Q lora residual from MLA [B, S, q_lora_rank] or [T, q_lora_rank]

freqs_cis

torch.Tensor

RoPE frequencies

attention_mask

torch.Tensor | NoneDefaults to None

Optional attention mask

**attn_kwargs

AnyDefaults to {}

Additional attention kwargs (cu_seqlens, etc.)

Returns: torch.Tensor

Indices of top-k positions [B, S, topk] or [T, topk]

nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer.init_weights(
    init_std: float = 0.02
)

class nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA(
    config: nemo_automodel.components.models.deepseek_v32.config.DeepseekV32Config,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Multi-head Latent Attention with Indexer for sparse attention.

This extends the V3 MLA with an Indexer module that performs top-k selection for sparse attention. The indexer uses the q_lora residual and hidden states to compute which positions to attend to.

index_topk

= config.index_topk

indexer

= DeepseekV32Indexer(config, backend)

kv_a_layernorm

kv_a_proj_with_mqa

kv_b_proj

kv_lora_rank

= config.kv_lora_rank

n_heads

= config.num_attention_heads

o_proj

q_a_layernorm

q_a_proj

q_b_proj

q_lora_rank

= config.q_lora_rank

qk_head_dim

qk_nope_head_dim

= config.qk_nope_head_dim

qk_rope_head_dim

= config.qk_rope_head_dim

rope_fusion

= backend.rope_fusion

softmax_scale

= self.qk_head_dim ** -0.5

v_head_dim

= config.v_head_dim

nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA._build_sparse_mask(
    topk_indices: torch.Tensor,
    seq_len: int,
    qkv_format: str,
    bsz: int = 1,
    n_heads: int = 1,
    dtype: torch.dtype = torch.bfloat16,
    attention_mask: torch.Tensor | None = None,
    union_across_batches: bool = False,
    as_bool: bool = False
) -> torch.Tensor

Build a sparse attention mask/bias from top-k indices.

Creates either an additive mask where non-top-k positions are set to -inf or a boolean keep-mask. TE consumes the additive mask as core_attention_bias; SDPA consumes the boolean mask to avoid bf16 additive-mask leakage in fused kernels.

Uses the same efficient pattern as the official DeepSeek inference code: torch.full(..., -inf).scatter_(-1, topk_indices, 0)

Parameters:

topk_indices

torch.Tensor

Indices of top-k positions [B, S, topk] or [T, topk]

seq_len

int

Sequence length

qkv_format

str

‘bshd’ or ‘thd’

bsz

intDefaults to 1

Batch size (only used for bshd format)

n_heads

intDefaults to 1

Number of attention heads to expand to

dtype

torch.dtypeDefaults to torch.bfloat16

Data type for the output tensor

attention_mask

torch.Tensor | NoneDefaults to None

Optional attention mask to combine with (for SDPA)

union_across_batches

boolDefaults to False

If True, union top-k across batches (for TE); if False, keep per-batch masks (for SDPA)

as_bool

boolDefaults to False

If True, return a boolean keep-mask (True = attend).

Returns: torch.Tensor

Mask tensor with shape:

[1, n_heads, S, S] if union_across_batches=True
[B, n_heads, S, S] if union_across_batches=False (bshd)
[1, n_heads, T, T] for thd format

nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
)

nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA.init_weights(
    _buffer_device: torch.device,
    init_std: float = 0.02
)

nemo_automodel.components.models.deepseek_v32.layers._rotate_activation(
    x: torch.Tensor
) -> torch.Tensor

Apply Hadamard rotation activation.

Parameters:

torch.Tensor

Input tensor (must be bfloat16).

Returns: torch.Tensor

Rotated tensor.

nemo_automodel.components.models.deepseek_v32.layers.hadamard_transform(
    x: torch.Tensor,
    scale: float
) -> torch.Tensor

Fallback hadamard_transform when fast_hadamard_transform is not available.

nemo_automodel.components.models.deepseek_v32.layers.hadamard_transform_torch(
    u,
    scale: float,
    normalize = False
)

Multiply H_n @ u where H_n is the Hadamard matrix of dimension n x n. n must be a power of 2. Parameters: u: Tensor of shape (…, n) normalize: if True, divide the result by 2^{m/2} where m = log_2(n). Returns: product: Tensor of shape (…, n)

nemo_automodel.components.models.deepseek_v32.layers._FAST_HADAMARD_AVAILABLE = True