nemo_automodel.components.models.deepseek_v4.state_dict_adapter

State dict adapter for DeepSeek V4.

HF V4 uses different key names compared to V3/V3.2. This adapter performs the necessary renaming on top of the standard FP8 dequantization and per-expert weight aggregation.

Key mapping (HF -> internal): embed.weight -> model.embed_tokens.weight norm.weight -> model.norm.weight head.weight -> lm_head.weight layers.{i}.attn_norm.weight -> model.layers.{i}.input_layernorm.weight layers.{i}.ffn_norm.weight -> model.layers.{i}.post_attention_layernorm.weight layers.{i}.attn.* -> model.layers.{i}.self_attn.* layers.{i}.ffn.gate.weight -> model.layers.{i}.mlp.gate.weight layers.{i}.ffn.gate.bias -> model.layers.{i}.mlp.gate.e_score_correction_bias layers.{i}.ffn.gate.tid2eid -> model.layers.{i}.mlp.gate.tid2eid (hash layers only) layers.{i}.ffn.shared_experts.w1.* -> model.layers.{i}.mlp.shared_experts.gate_proj.* layers.{i}.ffn.shared_experts.w3.* -> model.layers.{i}.mlp.shared_experts.up_proj.* layers.{i}.ffn.shared_experts.w2.* -> model.layers.{i}.mlp.shared_experts.down_proj.* layers.{i}.ffn.experts.{j}.w1.weight -> aggregated into model.layers.{i}.mlp.experts.gate_and_up_projs layers.{i}.ffn.experts.{j}.w3.weight -> aggregated into model.layers.{i}.mlp.experts.gate_and_up_projs layers.{i}.ffn.experts.{j}.w2.weight -> aggregated into model.layers.{i}.mlp.experts.down_projs layers.{i}.hc_attn_base/fn/scale -> model.layers.{i}.hc_attn_base/fn/scale layers.{i}.hc_ffn_base/fn/scale -> model.layers.{i}.hc_ffn_base/fn/scale

FP8 note: HF V4 stores scale as <key>.scale (not <key>.weight_scale_inv like V3). Both suffixes are handled by the dequantization step.

Module Contents

Classes

Name	Description
`DeepSeekV4StateDictAdapter`	State dict adapter for DeepSeek V4.
`_ExpertQuantLayout`	On-disk routed-expert quantization layout for DeepSeek V4 checkpoints.
`_HashBiasScope`	Key-format scope for :meth:`DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias`.

Functions

Name	Description
`_rename_hf_key`	Apply simple rename rules; returns the key unchanged if no rule matches.

Data

FP4_COL_BLOCK

_EXPERT_PATTERN

_FP4_E2M1_TABLE

_HF_TO_INTERNAL_RENAMES

API

class nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)

Bases: StateDictAdapter

State dict adapter for DeepSeek V4.

_INTERNAL_TO_HF_RENAMES

list[tuple[Pattern, str]]

_NON_QUANTIZED_PATTERNS

_checkpoint_expert_quant_layout_cache

_ExpertQuantLayout | None = None

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._aggregate_experts(
    state_dict: dict[str, typing.Any],
    device_mesh: torch.distributed.device_mesh.DeviceMesh | None
) -> dict[str, typing.Any]

Aggregate per-expert weights (w1/w2/w3) into stacked gate_and_up/down tensors.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp4_expert_placeholders(
    value: typing.Any
) -> tuple[typing.Any, typing.Any]

staticmethod

Return (int8 packed weight, float8_e8m0fnu scale) placeholders whose shapes / dtypes match the on-disk V4 Flash routed-expert layout.

The current value is the dequantized bf16 tensor with shape [out, in]; the checkpoint tensor is int8 [out, in // 2] with an e8m0 scale [out, in // 32]. DCP only uses these placeholders for shape/dtype validation and as the destination buffer — contents are overwritten on load, so we build empty tensors instead of re-packing real data.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp8_dtensor_scale_placeholder(
    value: typing.Any
) -> typing.Any

staticmethod

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp8_expert_placeholders(
    value: typing.Any
) -> tuple[typing.Any, typing.Any]

staticmethod

Return placeholders for the DeepSeek V4 Base routed-expert FP8 layout.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp8_global_scale_placeholder(
    value: typing.Any
) -> torch.Tensor

staticmethod

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._checkpoint_expert_quant_layout() -> nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._checkpoint_num_hash_layers() -> int

Read num_hash_layers directly from the checkpoint’s config.json.

We cannot rely on self.config.num_hash_layers alone: a YAML can legitimately override the model’s hash-layer count to 0 (e.g. to disable hash routing in the forward path), but the on-disk checkpoint still has its original value and therefore still omits gate.bias for the first num_hash_layers layers. To decide what to drop at load time we must know the checkpoint’s own value.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._dequantize(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]

Dequantize FP8 weights. Handles both .scale and _scale_inv suffixes.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._dequantize_expert_fp4(
    weight: torch.Tensor,
    scale: torch.Tensor,
    dtype: torch.dtype
) -> torch.Tensor

staticmethod

Unpack FP4 e2m1 packed-int8 weight and apply the per-row / 32-col e8m0 scale.

Packed layout: weight.int8 holds two FP4 values per byte — the low nibble at even column index, the high nibble at the following odd column — so the logical shape is [out, weight.size(-1) * 2].

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._dequantize_expert_weight(
    key: str,
    weight: torch.Tensor,
    scale: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._detect_checkpoint_expert_quant_layout() -> nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias(
    state_dict: dict[str, typing.Any],
    scope: '_HashBiasScope'
) -> dict[str, typing.Any]

The first num_hash_layers layers use hash-clustering routing and their HF checkpoint has no ffn.gate.bias / e_score_correction_bias tensor. The model side, however, creates the bias parameter uniformly for every layer (Automodel’s generic Gate always materializes it when gate_bias_update_factor > 0). Drop those bias keys before load so DCP does not raise Missing key in checkpoint state_dict for them.

scope selects which key format to match — the pre-rename internal form (model.layers.{i}.mlp.gate.e_score_correction_bias) or the post-rename HF form (layers.{i}.ffn.gate.bias).

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._empty_or_cast_fp8(
    value: torch.Tensor
) -> torch.Tensor

staticmethod

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._expert_quant_layout_from_tensors(
    weight: torch.Tensor,
    scale: torch.Tensor
) -> nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._expert_scale_shape(
    weight: torch.Tensor
) -> tuple[int, int]

Scale shape for an FP4 routed-expert weight tensor.

The weight argument should be the unpacked tensor (in the model-side state dict, experts are already materialized at full dtype), so its last dim is the true in dim and the scale has in // 32 columns.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._internal_key_to_hf(
    key: str
) -> str

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._is_expert_weight_key(
    key: str
) -> bool

staticmethod

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._is_non_quantized(
    hf_key: str
) -> bool

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._rename_all(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]

Apply the HF->internal rename table to every key.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._scale_shape(
    weight: torch.Tensor
) -> tuple[int, int]

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._scale_shape_from_shape(
    shape: torch.Size | tuple[int, ...]
) -> tuple[int, int]

staticmethod

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._split_merged_expert(
    fqn: str,
    tensor: typing.Any
) -> list[tuple[str, typing.Any]]

Inverse of expert aggregation: split gate_and_up/down stacks into per-expert keys.

Handles DTensor inputs (EP-sharded) by working on the local shard only, emitting keys only for the experts owned by the current rank.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
    kwargs = {}
) -> dict[str, typing.Any]

Convert HF checkpoint to internal format.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: str | None = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]

Convert internal state dict to HF V4 format.

Splits stacked expert weights back to per-expert w1/w2/w3 tensors, applies key renaming in reverse, and optionally quantizes to FP8.

class nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout

Bases: enum.Enum

On-disk routed-expert quantization layout for DeepSeek V4 checkpoints.

FP4

= 'fp4'

FP8

= 'fp8'

class nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HashBiasScope

Bases: enum.Enum

Key-format scope for :meth:DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias.

= re.compile('^layers\\.(\\d+)\\.ffn\\.gate\\.bias$')

INTERNAL

= re.compile('^model\\.layers\\.(\\d+)\\.mlp\\.gate\\.e_score_correction_bias$')

nemo_automodel.components.models.deepseek_v4.state_dict_adapter._rename_hf_key(
    key: str
) -> str

Apply simple rename rules; returns the key unchanged if no rule matches.

nemo_automodel.components.models.deepseek_v4.state_dict_adapter.FP4_COL_BLOCK = 32

nemo_automodel.components.models.deepseek_v4.state_dict_adapter._EXPERT_PATTERN = re.compile('^layers\\.(\\d+)\\.ffn\\.experts\\.(\\d+)\\.(w1|w2|w3)\\.weight$')

nemo_automodel.components.models.deepseek_v4.state_dict_adapter._FP4_E2M1_TABLE = torch.tensor([0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, 0.0, -0.5, -1.0, -1.5, -2....

nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HF_TO_INTERNAL_RENAMES: list[tuple[Pattern, str]] = [(re.compile('^embed\\.(.+)$'), 'model.embed_tokens.\\1'), (re.compile('^norm\\....