nemo_automodel.components.models.deepseek_v4.model
nemo_automodel.components.models.deepseek_v4.model
DeepSeek V4 Model.
Key architectural points (from official inference/model.py):
HC (Hyper-Connections): Every transformer block maintains hc_mult=4 copies of the hidden state. The embedding output is expanded: [B,S,dim] -> [B,S,hc_mult,dim]. hc_pre reduces [B,S,hc_mult,dim] -> [B,S,dim] before attn/ffn. hc_post expands [B,S,dim] -> [B,S,hc_mult,dim] after attn/ffn. Full HC requires the hc_split_sinkhorn CUDA kernel. Current fallback: mean-pooling for hc_pre, broadcast add for hc_post.
HC parameters (ALL layers, stored in float32): hc_attn_fn : [mix_hc, hc_mult*dim] where mix_hc = (2+hc_mult)hc_mult = 24 hc_attn_base : [mix_hc] hc_attn_scale : [3] hc_ffn_fn : [mix_hc, hc_multdim] hc_ffn_base : [mix_hc] hc_ffn_scale : [3]
Gate hash layers (layer_idx < num_hash_layers): Instead of score-based routing, the gate uses a fixed token-id -> expert-id lookup table (tid2eid: [vocab_size, n_activated_experts]).
All layers use MoE FFN (no dense layers). Compress-ratio sliding-window attention is not yet implemented.
Module Contents
Classes
Data
API
Bases: Module
Single transformer block for DeepSeek V4.
Uses HuggingFace transformers PR 45616’s HyperConnection decoder-layer
pattern: two DeepseekV4HyperConnection modules own the collapse /
expand mixer weights at the attention and FFN sites respectively.
Checkpoint’s flat hc_attn_* / hc_ffn_* keys are routed into
attn_hc.* / ffn_hc.* by the state-dict adapter.
Enable block-local checkpointing that avoids replaying MoE dispatch.
Bases: CausalLMOutputWithPast
Output of DeepseekV4ForCausalLM.forward.
Subclasses transformers.modeling_outputs.CausalLMOutputWithPast so the
standard logits / hidden_states fields are present (the recipe’s
fused cross-entropy path requires "hidden_states" in out and reads the
final hidden states off the output) while the DSV4-specific MTP fields are
carried as declared dataclass fields. As required by ModelOutput, every
field after the first declares a None default.
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
Keep DSV4 non-layer PP dependencies with the stages that need them.
Return PP input/output meta tensors for DSV4’s HC and MTP contract.
Model-owned context-parallel batch prep (Miles-style contiguous shard).
Returns the keys cp_utils.make_cp_batch_and_ctx needs to delegate CP
sharding back to this model: a _cp_make_batch_fn callable (with the
config-derived per-rank shard multiple bound) plus a flag asking the recipe
to keep the full logits in the autograd graph so every CP rank’s backward
reaches all parameters even when its local loss is fully masked. DSV4 embeds
internally, so (unlike VLM models) this does not pre-embed — it leaves
input_ids for the sharding callable.
Bases: Module
Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.
Instead of computing routing scores, the gate uses tid2eid[token_id] to pre-assign expert indices. The routing weight is still computed from the gate weight but the selection is deterministic per token id.
tid2eid shape: [vocab_size, n_activated_experts] (int64 runtime, non-trainable)
Signature matches components.moe.layers.Gate — forward(x, token_mask, cp_mesh) returning (weights, indices, aux_loss) — so the generic MoE
module can call it interchangeably. The per-forward input_ids needed
for the tid2eid lookup is stashed on the module by the enclosing Block via
:meth:set_input_ids immediately before the MoE call.
Stash the current batch’s input_ids for the next forward call.
No-op for compat with callers that walk MoE gates and call update_bias.
Bases: Module