nemo_automodel.components.models.deepseek_v4.layers
nemo_automodel.components.models.deepseek_v4.layers
DeepSeek V4 Attention Layer.
Architecture (from official inference/model.py):
KV path (K = V, single latent): x -> wkv [hidden -> head_dim] # single KV head, K = V = kv -> kv_norm (RMSNorm on head_dim) -> apply_rotary_emb on last rope_head_dim dims K = V = kv (one latent vector serves both key and value)
Output path (grouped): o [bsz, seq, n_heads, head_dim] -> reshape [bsz, seq, n_groups, n_heads_per_group * head_dim] -> wo_a einsum per group: [n_heads_per_group * head_dim] -> [o_lora_rank] -> reshape [bsz, seq, n_groups * o_lora_rank] -> wo_b [n_groups * o_lora_rank -> hidden]
attn_sink: learnable per-head scalar bias added to attention-sink position score.
HC (Hyper-Connections):
Each Block maintains hc_mult=4 copies of the hidden state.
hc_pre reduces [bsz, seq, hc_mult, dim] -> [bsz, seq, dim] via Sinkhorn mixing.
hc_post expands [bsz, seq, dim] -> [bsz, seq, hc_mult, dim].
See DeepseekV4HyperConnection.compute_weights and
optimized_kernels.dsv4_sinkhorn_normalize for the torch reference and
optional TileKernels Sinkhorn path.
Compress-ratio attention (Compressor + Indexer) is wired into DeepseekV4Attention.forward for layers with compress_ratio > 0. All layers share the same sliding-window causal mask on the local KV path.
Module Contents
Classes
Functions
API
Bases: Module
Sliding-window attention + Compressor + Indexer + attention sink.
Single-head KV (num_key_value_heads=1), grouped low-rank output via
:class:DeepseekV4GroupedLinear. compress_ratio == 0 layers skip
the compressor / indexer and run pure SWA.
Model-owned context-parallel hook, called by moe.parallelizer.apply_cp.
DSV4 runs Miles-style CP (contiguous query shard + all-gathered K/V), so there
is no TE DotProductAttention to configure — we just record the CP process
group that forward uses to all-gather K/V across CP ranks.
Bases: Module
HF PR 45616 port. Long-range KV branch. Pools compress_ratio tokens
into one compressed KV; when ratio == 4 the Indexer narrows the pool.
Bases: Module
Callable holder for fp32 tensors that need their own FSDP unit.
Bases: Linear
Block-diagonal grouped linear (HF PR 45616 port).
weight parameter has the standard nn.Linear shape
[out_features, in_features_per_group] so quantizers keyed on
nn.Linear.weight still find it; forward does per-group bmm.
Bases: Module
Per-site HyperConnection mixer (attention or FFN). Ported from
transformers/src/transformers/models/deepseek_v4/modular_deepseek_v4.py
class DeepseekV4HyperConnection.
Owns fn (packed linear), base (bias), and scale (scalar
per-head gains). compute_weights produces three mixer tensors:
pre[B, S, H] : sigmoid-gated collapse weightspost[B, S, H] : sigmoid-gated expand weightscomb[B, S, H, H] : doubly-stochastic combination matrix from Sinkhorn-normalising sigmoid gates
All math runs in fp32 regardless of the outer cast policy; parameters
cast themselves via .float() on each forward. HF lists these params
in _keep_in_fp32_modules_strict — the KAutomodel adapter does the
same via submodule-name matching.
Bases: Module
Final HC-stream collapse before the shared RMSNorm + lm_head.
Ported from modular_deepseek_v4.py class DeepseekV4HyperHead.
Sigmoid-weighted sum over the hc_mult streams (no Sinkhorn). Used
once at the end of DeepseekV4Model.forward to go from
[B, S, H, D] back to [B, S, D].
Bases: Module
HF PR 45616 port. Picks the top-k compressed positions per query when
compress_ratio == 4. Owns its own pool at index_head_dim plus a
query projection + weights_proj head-mixer.
Bases: Module
V4 rotary embedding. Produces (cos, sin) sized to qk_rope_head_dim
(via partial_rotary_factor = qk_rope_head_dim / head_dim), matching HF.
YaRN: when rope_scaling is a YaRN-typed dict
({"type": "yarn", "factor": F, "original_max_position_embeddings": L0, "beta_fast": ..., "beta_slow": ...}), modify inv_freq per
dsv4flash/inference/model.py:precompute_freqs_cis — frequency
interpolation with a smooth linear ramp between beta_fast/beta_slow
correction dims. Used by the compress-rope (theta=160000) on layers
with compress_ratio > 0. The main rope (theta=10000, used only on
sliding-window layers) gets rope_scaling=None because the reference
builds it with original_seq_len=0 for those layers.
Training-only cache shim mirroring the three methods DeepseekV4Compressor
/ DeepseekV4Indexer call on DeepseekV4Cache.
KAutomodel training forward is stateless — we never persist KV or compressor
windows across calls. Each DeepseekV4Attention.forward creates a fresh
cache instance, which holds per-layer scratch dicts for the duration of the
call. When a full window hasn’t accumulated yet we return an empty tensor
and let the downstream code handle it.
Split x along its last dim into nope (first) and rope (last
rope_head_dim) slices, rotate only the rope slice with INTERLEAVED
pair-RoPE (pairs (2k, 2k+1)), concat back.
The DSV4-Flash released checkpoint uses interleaved RoPE end-to-end
(see dsv4flash/inference/model.py:apply_rotary_emb — complex
multiplication on view_as_complex of pairs). HF transformers PR
45616 / PR 45643 ship a Llama-style rotate_half here instead, which
pairs (d, d+rd/2). Same algebra but a different dim-to-frequency
mapping — the released weights expect the interleaved layout, so the
Llama-style helper produces wrong activations on the released checkpoint
(verified empirically: kv_post_rope cosine drops from 0.9999 to 0.866
after one block under Llama-style; matches at >0.999 under interleaved).
Interleaved RoPE on the last rope_head_dim dims of x (pairs are
(2k, 2k+1)). Matches the DeepSeek inference reference’s complex-mul
formulation in dsv4flash/inference/model.py:apply_rotary_emb: the
released DSV4-Flash weights were trained with this layout, NOT the
Llama-style rotate_half layout HF transformers PR 45616/45643 still
uses (pairs (d, d+rd/2)).
Inverse rotation: pass -sin instead of sin (caller’s
responsibility — same as our existing inverse-rope call site).
Parameters:
[..., rope_head_dim] (or larger trailing dim with rope on the
last rope_head_dim slice). Typical attention-layout shapes:
[B, H, S, D] for q/k or [B, 1, S, D] for shared-KV.
shape [B, S, rope_head_dim] produced by the Llama-style
cat([freqs, freqs], -1) rotary; we take the first half which
contains the unique per-pair frequencies (the second half is a
duplicate that the Llama-style helper needs and we don’t).
Must be even.
Build the additive compressed-position mask for Indexer-selected pool IDs.
Use TileLang DSV4 kernels only when the attention backend requests them.
Reshape [B, S, ratio, 2*head_dim] -> [B, S, 2*ratio, head_dim] with the
cross-window overlap from the DeepSeek inference reference (Compressor.overlap_transform
in dsv4flash/inference/model.py:307-314).
Window 0 has no previous block, so its [:ratio] slice is left at fill_value
(0 for the kv tensor, -inf for the score tensor so softmax masks it out).
Softmax-gated sum-pool over ratio consecutive tokens.
Non-overlap mode (HF PR 45616 layout, ratio==128 in V4-Flash):
Input kv/gate of shape [B, length, head_dim].
Reshape to [B, length/ratio, ratio, head_dim] and pool over the ratio axis.
Overlap mode (DeepSeek inference reference layout, ratio==4 in V4-Flash):
Input kv/gate of shape [B, length, 2*head_dim] (wkv/wgate
project to 2*head_dim so each window can carry both its own kv and a
half-overlap into the next window).
Reshape to [B, length/ratio, ratio, 2*head_dim], apply :func:_overlap_transform
to remap to [B, length/ratio, 2*ratio, head_dim], then pool over the 2*ratio
axis. Each compressed token thus aggregates 2*ratio = 8 raw tokens — the
ratio tokens of the current window plus the ratio tokens of the previous
window — giving smoother compression boundaries that the released checkpoint
was trained under.
HF PR 45616 omits the overlap path entirely; the released DSV4-Flash safetensors
have ape/wkv/wgate shapes that only match the overlap layout ([ratio, 2*head_dim] and [2*head_dim, hidden]), so we must support it here to load
the released weights.
RMS-normalize the last dim without materializing an x.square() tensor.
Rotate half the hidden dims of the input (Llama / GPT-NeoX style).
Port of transformers.models.llama.modeling_llama.apply_rotary_pos_emb.
Build a 4D additive causal+padding (+optional sliding-window) mask
compatible with eager_attention_with_sink.
Mirrors HF’s create_sliding_window_causal_mask (used in
DeepseekV4Model.forward): each query at position i attends only to
keys at positions [max(0, i - sliding_window + 1), i]. The DSV4-Flash
weights were trained with this banding on every layer, so dropping it makes
the softmax see a different distribution than training and degrades loss.
Returns:
[B, 1, S, S] additive mask of dtype (0 where keep, large negative
where mask).
Build a 4D additive block-causal mask from packed-sequence lengths.
Eager attention with per-head sink: appends an extra softmax column
whose logit is module.sinks[h] and whose value-slot is zero. Ported
verbatim from HF PR 45616.
Port of transformers.models.llama.modeling_llama.repeat_kv.