nemo_automodel.components.models.deepseek_v4.state_dict_adapter#
State dict adapter for DeepSeek V4.
HF V4 uses different key names compared to V3/V3.2. This adapter performs the necessary renaming on top of the standard FP8 dequantization and per-expert weight aggregation.
Key mapping (HF -> internal): embed.weight -> model.embed_tokens.weight norm.weight -> model.norm.weight head.weight -> lm_head.weight layers.{i}.attn_norm.weight -> model.layers.{i}.input_layernorm.weight layers.{i}.ffn_norm.weight -> model.layers.{i}.post_attention_layernorm.weight layers.{i}.attn.* -> model.layers.{i}.self_attn.* layers.{i}.ffn.gate.weight -> model.layers.{i}.mlp.gate.weight layers.{i}.ffn.gate.bias -> model.layers.{i}.mlp.gate.e_score_correction_bias layers.{i}.ffn.gate.tid2eid -> model.layers.{i}.mlp.gate.tid2eid (hash layers only) layers.{i}.ffn.shared_experts.w1.* -> model.layers.{i}.mlp.shared_experts.gate_proj.* layers.{i}.ffn.shared_experts.w3.* -> model.layers.{i}.mlp.shared_experts.up_proj.* layers.{i}.ffn.shared_experts.w2.* -> model.layers.{i}.mlp.shared_experts.down_proj.* layers.{i}.ffn.experts.{j}.w1.weight -> aggregated into model.layers.{i}.mlp.experts.gate_and_up_projs layers.{i}.ffn.experts.{j}.w3.weight -> aggregated into model.layers.{i}.mlp.experts.gate_and_up_projs layers.{i}.ffn.experts.{j}.w2.weight -> aggregated into model.layers.{i}.mlp.experts.down_projs layers.{i}.hc_attn_base/fn/scale -> model.layers.{i}.hc_attn_base/fn/scale layers.{i}.hc_ffn_base/fn/scale -> model.layers.{i}.hc_ffn_base/fn/scale
FP8 note: HF V4 stores scale as <key>.scale (not <key>.weight_scale_inv like V3).
Both suffixes are handled by the dequantization step.
Module Contents#
Classes#
Key-format scope for :meth: |
|
On-disk routed-expert quantization layout for DeepSeek V4 checkpoints. |
|
State dict adapter for DeepSeek V4. |
Functions#
Apply simple rename rules; returns the key unchanged if no rule matches. |
Data#
API#
- nemo_automodel.components.models.deepseek_v4.state_dict_adapter.FP4_COL_BLOCK#
32
- nemo_automodel.components.models.deepseek_v4.state_dict_adapter._FP4_E2M1_TABLE#
‘tensor(…)’
- nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HF_TO_INTERNAL_RENAMES: list[tuple[re.Pattern, str]]#
[(), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), ()]
- nemo_automodel.components.models.deepseek_v4.state_dict_adapter._EXPERT_PATTERN#
‘compile(…)’
- class nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HashBiasScope(*args, **kwds)#
Bases:
enum.EnumKey-format scope for :meth:
DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias.Initialization
- INTERNAL#
‘compile(…)’
- HF#
‘compile(…)’
- class nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout(*args, **kwds)#
Bases:
enum.EnumOn-disk routed-expert quantization layout for DeepSeek V4 checkpoints.
Initialization
- FP4#
‘fp4’
- FP8#
‘fp8’
- nemo_automodel.components.models.deepseek_v4.state_dict_adapter._rename_hf_key(key: str) str#
Apply simple rename rules; returns the key unchanged if no rule matches.
- class nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter(
- config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
- dtype: torch.dtype = torch.float32,
Bases:
nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapterState dict adapter for DeepSeek V4.
Initialization
- from_hf(
- hf_state_dict: dict[str, Any],
- device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
- **kwargs,
Convert HF checkpoint to internal format.
Steps:
Dequantize FP8 weights (scale suffix is either
.scaleor_scale_inv).Aggregate per-expert routed weights into stacked tensors.
Rename remaining keys using the HF -> internal mapping table.
- _dequantize(
- state_dict: dict[str, Any],
Dequantize FP8 weights. Handles both
.scaleand_scale_invsuffixes.
- _aggregate_experts(
- state_dict: dict[str, Any],
- device_mesh: torch.distributed.device_mesh.DeviceMesh | None,
Aggregate per-expert weights (w1/w2/w3) into stacked gate_and_up/down tensors.
- _rename_all(
- state_dict: dict[str, Any],
Apply the HF->internal rename table to every key.
- to_hf(
- state_dict: dict[str, Any],
- exclude_key_regex: str | None = None,
- quantization: bool = False,
- **kwargs,
Convert internal state dict to HF V4 format.
Splits stacked expert weights back to per-expert w1/w2/w3 tensors, applies key renaming in reverse, and optionally quantizes to FP8.
- _checkpoint_num_hash_layers() int#
Read
num_hash_layersdirectly from the checkpoint’s config.json.We cannot rely on
self.config.num_hash_layersalone: a YAML can legitimately override the model’s hash-layer count to 0 (e.g. to disable hash routing in the forward path), but the on-disk checkpoint still has its original value and therefore still omits gate.bias for the firstnum_hash_layerslayers. To decide what to drop at load time we must know the checkpoint’s own value.
- _drop_hash_layer_gate_bias(
- state_dict: dict[str, Any],
- scope: nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HashBiasScope,
The first
num_hash_layerslayers use hash-clustering routing and their HF checkpoint has noffn.gate.bias/e_score_correction_biastensor. The model side, however, creates the bias parameter uniformly for every layer (Automodel’s generic Gate always materializes it whengate_bias_update_factor > 0). Drop those bias keys before load so DCP does not raiseMissing key in checkpoint state_dictfor them.scopeselects which key format to match — the pre-rename internal form (model.layers.{i}.mlp.gate.e_score_correction_bias) or the post-rename HF form (layers.{i}.ffn.gate.bias).
- _INTERNAL_TO_HF_RENAMES: list[tuple[re.Pattern, str]]#
[(), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), ()]
- _internal_key_to_hf(key: str) str#
- convert_single_tensor_to_hf(
- fqn: str,
- tensor: Any,
- **kwargs,
- static _build_fp4_expert_placeholders(
- value: Any,
Return (int8 packed weight, float8_e8m0fnu scale) placeholders whose shapes / dtypes match the on-disk V4 Flash routed-expert layout.
The current
valueis the dequantized bf16 tensor with shape [out, in]; the checkpoint tensor is int8 [out, in // 2] with an e8m0 scale [out, in // 32]. DCP only uses these placeholders for shape/dtype validation and as the destination buffer — contents are overwritten on load, so we build empty tensors instead of re-packing real data.
- static _build_fp8_expert_placeholders(
- value: Any,
Return placeholders for the DeepSeek V4 Base routed-expert FP8 layout.
- static _build_fp8_global_scale_placeholder(value: Any) torch.Tensor#
- static _build_fp8_dtensor_scale_placeholder(value: Any) Any#
- static _empty_or_cast_fp8(value: torch.Tensor) torch.Tensor#
- _NON_QUANTIZED_PATTERNS#
[‘attn_norm.weight’, ‘ffn_norm.weight’, ‘norm.weight’, ‘head.weight’, ‘embed.weight’, ‘ffn.gate.weig…
- _is_non_quantized(hf_key: str) bool#
- static _is_expert_weight_key(key: str) bool#
- _scale_shape(weight: torch.Tensor) tuple[int, int]#
- static _scale_shape_from_shape(
- shape: torch.Size | tuple[int, ...],
- _expert_scale_shape(weight: torch.Tensor) tuple[int, int]#
Scale shape for an FP4 routed-expert weight tensor.
The weight argument should be the unpacked tensor (in the model-side state dict, experts are already materialized at full dtype), so its last dim is the true
indim and the scale hasin // 32columns.
- _dequantize_expert_weight(
- key: str,
- weight: torch.Tensor,
- scale: torch.Tensor,
- _expert_quant_layout_from_tensors(
- weight: torch.Tensor,
- scale: torch.Tensor,
- _checkpoint_expert_quant_layout() nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout#
- _detect_checkpoint_expert_quant_layout() nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout#
- static _dequantize_expert_fp4(
- weight: torch.Tensor,
- scale: torch.Tensor,
- dtype: torch.dtype,
Unpack FP4 e2m1 packed-int8 weight and apply the per-row / 32-col e8m0 scale.
Packed layout:
weight.int8holds two FP4 values per byte — the low nibble at even column index, the high nibble at the following odd column — so the logical shape is[out, weight.size(-1) * 2].
- _split_merged_expert(
- fqn: str,
- tensor: Any,
Inverse of expert aggregation: split gate_and_up/down stacks into per-expert keys.
Handles DTensor inputs (EP-sharded) by working on the local shard only, emitting keys only for the experts owned by the current rank.