nemo_automodel.components.models.deepseek_v32.layers
nemo_automodel.components.models.deepseek_v32.layers
DeepSeek V3.2 Layers.
Contains the DeepseekV32Indexer for top-k sparse attention selection and DeepseekV32MLA which integrates the indexer with Multi-head Latent Attention.
Module Contents
Classes
Functions
Data
API
Bases: Module
Indexer for top-k sparse attention selection.
Based on the official DeepSeek V3.2 training implementation. Computes attention scores between queries and keys with per-head weights, applies ReLU activation, then selects the top-k positions to attend to.
Key features:
- Uses LayerNorm (not RMSNorm) for key normalization
- Has a weights_proj that learns per-head importance weights
- Optional Hadamard transform (rotate_activation) on Q and K
- ReLU activation on attention scores before weighting
Compute top-k indices for sparse attention.
Parameters:
Hidden states [B, S, hidden] or [T, hidden] for thd format
Q lora residual from MLA [B, S, q_lora_rank] or [T, q_lora_rank]
RoPE frequencies
Optional attention mask
Additional attention kwargs (cu_seqlens, etc.)
Returns: torch.Tensor
Indices of top-k positions [B, S, topk] or [T, topk]
Bases: Module
Multi-head Latent Attention with Indexer for sparse attention.
This extends the V3 MLA with an Indexer module that performs top-k selection for sparse attention. The indexer uses the q_lora residual and hidden states to compute which positions to attend to.
Build a sparse attention mask/bias from top-k indices.
Creates either an additive mask where non-top-k positions are set to
-inf or a boolean keep-mask. TE consumes the additive mask as
core_attention_bias; SDPA consumes the boolean mask to avoid bf16
additive-mask leakage in fused kernels.
Uses the same efficient pattern as the official DeepSeek inference code:
torch.full(..., -inf).scatter_(-1, topk_indices, 0)
Parameters:
Indices of top-k positions [B, S, topk] or [T, topk]
Sequence length
‘bshd’ or ‘thd’
Batch size (only used for bshd format)
Number of attention heads to expand to
Data type for the output tensor
Optional attention mask to combine with (for SDPA)
If True, union top-k across batches (for TE); if False, keep per-batch masks (for SDPA)
If True, return a boolean keep-mask (True = attend).
Returns: torch.Tensor
Mask tensor with shape:
- [1, n_heads, S, S] if union_across_batches=True
- [B, n_heads, S, S] if union_across_batches=False (bshd)
- [1, n_heads, T, T] for thd format
Apply Hadamard rotation activation.
Parameters:
Input tensor (must be bfloat16).
Returns: torch.Tensor
Rotated tensor.
Fallback hadamard_transform when fast_hadamard_transform is not available.
Multiply H_n @ u where H_n is the Hadamard matrix of dimension n x n. n must be a power of 2. Parameters: u: Tensor of shape (…, n) normalize: if True, divide the result by 2^{m/2} where m = log_2(n). Returns: product: Tensor of shape (…, n)