nemo_automodel.components.models.deepseek_v4.state_dict_adapter
nemo_automodel.components.models.deepseek_v4.state_dict_adapter
State dict adapter for DeepSeek V4.
HF V4 uses different key names compared to V3/V3.2. This adapter performs the necessary renaming on top of the standard FP8 dequantization and per-expert weight aggregation.
Key mapping (HF -> internal): embed.weight -> model.embed_tokens.weight norm.weight -> model.norm.weight head.weight -> lm_head.weight layers.{i}.attn_norm.weight -> model.layers.{i}.input_layernorm.weight layers.{i}.ffn_norm.weight -> model.layers.{i}.post_attention_layernorm.weight layers.{i}.attn.* -> model.layers.{i}.self_attn.* layers.{i}.ffn.gate.weight -> model.layers.{i}.mlp.gate.weight layers.{i}.ffn.gate.bias -> model.layers.{i}.mlp.gate.e_score_correction_bias layers.{i}.ffn.gate.tid2eid -> model.layers.{i}.mlp.gate.tid2eid (hash layers only) layers.{i}.ffn.shared_experts.w1.* -> model.layers.{i}.mlp.shared_experts.gate_proj.* layers.{i}.ffn.shared_experts.w3.* -> model.layers.{i}.mlp.shared_experts.up_proj.* layers.{i}.ffn.shared_experts.w2.* -> model.layers.{i}.mlp.shared_experts.down_proj.* layers.{i}.ffn.experts.{j}.w1.weight -> aggregated into model.layers.{i}.mlp.experts.gate_and_up_projs layers.{i}.ffn.experts.{j}.w3.weight -> aggregated into model.layers.{i}.mlp.experts.gate_and_up_projs layers.{i}.ffn.experts.{j}.w2.weight -> aggregated into model.layers.{i}.mlp.experts.down_projs layers.{i}.hc_attn_base/fn/scale -> model.layers.{i}.hc_attn_base/fn/scale layers.{i}.hc_ffn_base/fn/scale -> model.layers.{i}.hc_ffn_base/fn/scale
FP8 note: HF V4 stores scale as <key>.scale (not <key>.weight_scale_inv like V3).
Both suffixes are handled by the dequantization step.
Module Contents
Classes
Functions
Data
API
Bases: StateDictAdapter
State dict adapter for DeepSeek V4.
Aggregate per-expert weights (w1/w2/w3) into stacked gate_and_up/down tensors.
Return (int8 packed weight, float8_e8m0fnu scale) placeholders whose shapes / dtypes match the on-disk V4 Flash routed-expert layout.
The current value is the dequantized bf16 tensor with shape [out, in];
the checkpoint tensor is int8 [out, in // 2] with an e8m0 scale
[out, in // 32]. DCP only uses these placeholders for shape/dtype
validation and as the destination buffer — contents are overwritten on
load, so we build empty tensors instead of re-packing real data.
Return placeholders for the DeepSeek V4 Base routed-expert FP8 layout.
Read num_hash_layers directly from the checkpoint’s config.json.
We cannot rely on self.config.num_hash_layers alone: a YAML can
legitimately override the model’s hash-layer count to 0 (e.g. to
disable hash routing in the forward path), but the on-disk checkpoint
still has its original value and therefore still omits gate.bias for
the first num_hash_layers layers. To decide what to drop at load
time we must know the checkpoint’s own value.
Dequantize FP8 weights. Handles both .scale and _scale_inv suffixes.
Unpack FP4 e2m1 packed-int8 weight and apply the per-row / 32-col e8m0 scale.
Packed layout: weight.int8 holds two FP4 values per byte — the low nibble
at even column index, the high nibble at the following odd column — so the
logical shape is [out, weight.size(-1) * 2].
The first num_hash_layers layers use hash-clustering routing and
their HF checkpoint has no ffn.gate.bias / e_score_correction_bias
tensor. The model side, however, creates the bias parameter uniformly
for every layer (Automodel’s generic Gate always materializes it when
gate_bias_update_factor > 0). Drop those bias keys before load so
DCP does not raise Missing key in checkpoint state_dict for them.
scope selects which key format to match — the pre-rename internal
form (model.layers.{i}.mlp.gate.e_score_correction_bias) or the
post-rename HF form (layers.{i}.ffn.gate.bias).
Scale shape for an FP4 routed-expert weight tensor.
The weight argument should be the unpacked tensor (in the model-side
state dict, experts are already materialized at full dtype), so its
last dim is the true in dim and the scale has in // 32 columns.
Apply the HF->internal rename table to every key.
Inverse of expert aggregation: split gate_and_up/down stacks into per-expert keys.
Handles DTensor inputs (EP-sharded) by working on the local shard only, emitting keys only for the experts owned by the current rank.
Convert HF checkpoint to internal format.
Convert internal state dict to HF V4 format.
Splits stacked expert weights back to per-expert w1/w2/w3 tensors, applies key renaming in reverse, and optionally quantizes to FP8.
Bases: enum.Enum
On-disk routed-expert quantization layout for DeepSeek V4 checkpoints.
Bases: enum.Enum
Key-format scope for :meth:DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias.
Apply simple rename rules; returns the key unchanged if no rule matches.