nemo_automodel.components.models.deepseek_v32.layers

View as Markdown

DeepSeek V3.2 Layers.

Contains the DeepseekV32Indexer for top-k sparse attention selection and DeepseekV32MLA which integrates the indexer with Multi-head Latent Attention.

Module Contents

Classes

NameDescription
DeepseekV32IndexerIndexer for top-k sparse attention selection.
DeepseekV32MLAMulti-head Latent Attention with Indexer for sparse attention.

Functions

NameDescription
_rotate_activationApply Hadamard rotation activation.
hadamard_transformFallback hadamard_transform when fast_hadamard_transform is not available.
hadamard_transform_torchMultiply H_n @ u where H_n is the Hadamard matrix of dimension n x n.

Data

_FAST_HADAMARD_AVAILABLE

API

class nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer(
config: nemo_automodel.components.models.deepseek_v32.config.DeepseekV32Config,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Indexer for top-k sparse attention selection.

Based on the official DeepSeek V3.2 training implementation. Computes attention scores between queries and keys with per-head weights, applies ReLU activation, then selects the top-k positions to attend to.

Key features:

  • Uses LayerNorm (not RMSNorm) for key normalization
  • Has a weights_proj that learns per-head importance weights
  • Optional Hadamard transform (rotate_activation) on Q and K
  • ReLU activation on attention scores before weighting
head_dim
= config.index_head_dim
hidden_size
= config.hidden_size
index_topk
= config.index_topk
k_norm
= nn.LayerNorm(self.head_dim, dtype=dtype)
num_heads
= config.index_n_heads
q_lora_rank
= config.q_lora_rank
qk_nope_head_dim
= self.head_dim - self.qk_rope_head_dim
qk_rope_head_dim
= config.qk_rope_head_dim
softmax_scale
= self.head_dim ** -0.5
weights_proj
wk
wq_b
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer.forward(
x: torch.Tensor,
q_resid: torch.Tensor,
freqs_cis: torch.Tensor,
attention_mask: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor

Compute top-k indices for sparse attention.

Parameters:

x
torch.Tensor

Hidden states [B, S, hidden] or [T, hidden] for thd format

q_resid
torch.Tensor

Q lora residual from MLA [B, S, q_lora_rank] or [T, q_lora_rank]

freqs_cis
torch.Tensor

RoPE frequencies

attention_mask
torch.Tensor | NoneDefaults to None

Optional attention mask

**attn_kwargs
AnyDefaults to {}

Additional attention kwargs (cu_seqlens, etc.)

Returns: torch.Tensor

Indices of top-k positions [B, S, topk] or [T, topk]

nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32Indexer.init_weights(
init_std: float = 0.02
)
class nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA(
config: nemo_automodel.components.models.deepseek_v32.config.DeepseekV32Config,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Multi-head Latent Attention with Indexer for sparse attention.

This extends the V3 MLA with an Indexer module that performs top-k selection for sparse attention. The indexer uses the q_lora residual and hidden states to compute which positions to attend to.

index_topk
= config.index_topk
indexer
= DeepseekV32Indexer(config, backend)
kv_a_layernorm
kv_a_proj_with_mqa
kv_b_proj
kv_lora_rank
= config.kv_lora_rank
n_heads
= config.num_attention_heads
o_proj
q_a_layernorm
q_a_proj
q_b_proj
q_lora_rank
= config.q_lora_rank
qk_head_dim
qk_nope_head_dim
= config.qk_nope_head_dim
qk_rope_head_dim
= config.qk_rope_head_dim
rope_fusion
= backend.rope_fusion
softmax_scale
= self.qk_head_dim ** -0.5
v_head_dim
= config.v_head_dim
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA._build_sparse_mask(
topk_indices: torch.Tensor,
seq_len: int,
qkv_format: str,
bsz: int = 1,
n_heads: int = 1,
dtype: torch.dtype = torch.bfloat16,
attention_mask: torch.Tensor | None = None,
union_across_batches: bool = False,
as_bool: bool = False
) -> torch.Tensor

Build a sparse attention mask/bias from top-k indices.

Creates either an additive mask where non-top-k positions are set to -inf or a boolean keep-mask. TE consumes the additive mask as core_attention_bias; SDPA consumes the boolean mask to avoid bf16 additive-mask leakage in fused kernels.

Uses the same efficient pattern as the official DeepSeek inference code: torch.full(..., -inf).scatter_(-1, topk_indices, 0)

Parameters:

topk_indices
torch.Tensor

Indices of top-k positions [B, S, topk] or [T, topk]

seq_len
int

Sequence length

qkv_format
str

‘bshd’ or ‘thd’

bsz
intDefaults to 1

Batch size (only used for bshd format)

n_heads
intDefaults to 1

Number of attention heads to expand to

dtype
torch.dtypeDefaults to torch.bfloat16

Data type for the output tensor

attention_mask
torch.Tensor | NoneDefaults to None

Optional attention mask to combine with (for SDPA)

union_across_batches
boolDefaults to False

If True, union top-k across batches (for TE); if False, keep per-batch masks (for SDPA)

as_bool
boolDefaults to False

If True, return a boolean keep-mask (True = attend).

Returns: torch.Tensor

Mask tensor with shape:

  • [1, n_heads, S, S] if union_across_batches=True
  • [B, n_heads, S, S] if union_across_batches=False (bshd)
  • [1, n_heads, T, T] for thd format
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA.forward(
x: torch.Tensor,
freqs_cis: torch.Tensor,
attention_mask: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
)
nemo_automodel.components.models.deepseek_v32.layers.DeepseekV32MLA.init_weights(
_buffer_device: torch.device,
init_std: float = 0.02
)
nemo_automodel.components.models.deepseek_v32.layers._rotate_activation(
x: torch.Tensor
) -> torch.Tensor

Apply Hadamard rotation activation.

Parameters:

x
torch.Tensor

Input tensor (must be bfloat16).

Returns: torch.Tensor

Rotated tensor.

nemo_automodel.components.models.deepseek_v32.layers.hadamard_transform(
x: torch.Tensor,
scale: float
) -> torch.Tensor

Fallback hadamard_transform when fast_hadamard_transform is not available.

nemo_automodel.components.models.deepseek_v32.layers.hadamard_transform_torch(
u,
scale: float,
normalize = False
)

Multiply H_n @ u where H_n is the Hadamard matrix of dimension n x n. n must be a power of 2. Parameters: u: Tensor of shape (…, n) normalize: if True, divide the result by 2^{m/2} where m = log_2(n). Returns: product: Tensor of shape (…, n)

nemo_automodel.components.models.deepseek_v32.layers._FAST_HADAMARD_AVAILABLE = True