> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.deepseek_v4.state_dict_adapter

State dict adapter for DeepSeek V4.

HF V4 uses different key names compared to V3/V3.2.  This adapter performs
the necessary renaming on top of the standard FP8 dequantization and
per-expert weight aggregation.

Key mapping (HF -> internal):
embed.weight                          -> model.embed\_tokens.weight
norm.weight                           -> model.norm.weight
head.weight                           -> lm\_head.weight
layers.\{i}.attn\_norm.weight           -> model.layers.\{i}.input\_layernorm.weight
layers.\{i}.ffn\_norm.weight            -> model.layers.\{i}.post\_attention\_layernorm.weight
layers.\{i}.attn.\*                     -> model.layers.\{i}.self\_attn.\*
layers.\{i}.ffn.gate.weight            -> model.layers.\{i}.mlp.gate.weight
layers.\{i}.ffn.gate.bias             -> model.layers.\{i}.mlp.gate.e\_score\_correction\_bias
layers.\{i}.ffn.gate.tid2eid          -> model.layers.\{i}.mlp.gate.tid2eid  (hash layers only)
layers.\{i}.ffn.shared\_experts.w1.\*   -> model.layers.\{i}.mlp.shared\_experts.gate\_proj.\*
layers.\{i}.ffn.shared\_experts.w3.\*   -> model.layers.\{i}.mlp.shared\_experts.up\_proj.\*
layers.\{i}.ffn.shared\_experts.w2.\*   -> model.layers.\{i}.mlp.shared\_experts.down\_proj.\*
layers.\{i}.ffn.experts.\{j}.w1.weight -> aggregated into model.layers.\{i}.mlp.experts.gate\_and\_up\_projs
layers.\{i}.ffn.experts.\{j}.w3.weight -> aggregated into model.layers.\{i}.mlp.experts.gate\_and\_up\_projs
layers.\{i}.ffn.experts.\{j}.w2.weight -> aggregated into model.layers.\{i}.mlp.experts.down\_projs
layers.\{i}.hc\_attn\_base/fn/scale     -> model.layers.\{i}.hc\_attn\_base/fn/scale
layers.\{i}.hc\_ffn\_base/fn/scale      -> model.layers.\{i}.hc\_ffn\_base/fn/scale

FP8 note: HF V4 stores scale as `&lt;key&gt;.scale` (not `&lt;key&gt;.weight_scale_inv` like V3).
Both suffixes are handled by the dequantization step.

## Module Contents

### Classes

| Name                                                                                                                        | Description                                                                         |
| --------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| [`DeepSeekV4StateDictAdapter`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-DeepSeekV4StateDictAdapter) | State dict adapter for DeepSeek V4.                                                 |
| [`_ExpertQuantLayout`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-_ExpertQuantLayout)                 | On-disk routed-expert quantization layout for DeepSeek V4 checkpoints.              |
| [`_HashBiasScope`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-_HashBiasScope)                         | Key-format scope for :meth:`DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias`. |

### Functions

| Name                                                                                                | Description                                                              |
| --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| [`_rename_hf_key`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-_rename_hf_key) | Apply simple rename rules; returns the key unchanged if no rule matches. |

### Data

[`FP4_COL_BLOCK`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-FP4_COL_BLOCK)

[`_EXPERT_PATTERN`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-_EXPERT_PATTERN)

[`_FP4_E2M1_TABLE`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-_FP4_E2M1_TABLE)

[`_HF_TO_INTERNAL_RENAMES`](#nemo_automodel-components-models-deepseek_v4-state_dict_adapter-_HF_TO_INTERNAL_RENAMES)

### API

```python
class nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

State dict adapter for DeepSeek V4.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._aggregate_experts(
    state_dict: dict[str, typing.Any],
    device_mesh: torch.distributed.device_mesh.DeviceMesh | None
) -> dict[str, typing.Any]
```

Aggregate per-expert weights (w1/w2/w3) into stacked gate\_and\_up/down tensors.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp4_expert_placeholders(
    value: typing.Any
) -> tuple[typing.Any, typing.Any]
```

staticmethod

Return (int8 packed weight, float8\_e8m0fnu scale) placeholders whose
shapes / dtypes match the on-disk V4 Flash routed-expert layout.

The current `value` is the dequantized bf16 tensor with shape \[out, in];
the checkpoint tensor is int8 \[out, in // 2] with an e8m0 scale
\[out, in // 32].  DCP only uses these placeholders for shape/dtype
validation and as the destination buffer — contents are overwritten on
load, so we build empty tensors instead of re-packing real data.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp8_dtensor_scale_placeholder(
    value: typing.Any
) -> typing.Any
```

staticmethod

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp8_expert_placeholders(
    value: typing.Any
) -> tuple[typing.Any, typing.Any]
```

staticmethod

Return placeholders for the DeepSeek V4 Base routed-expert FP8 layout.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._build_fp8_global_scale_placeholder(
    value: typing.Any
) -> torch.Tensor
```

staticmethod

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._checkpoint_expert_quant_layout() -> nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._checkpoint_num_hash_layers() -> int
```

Read `num_hash_layers` directly from the checkpoint's config.json.

We cannot rely on `self.config.num_hash_layers` alone: a YAML can
legitimately override the model's hash-layer count to 0 (e.g. to
disable hash routing in the forward path), but the on-disk checkpoint
still has its original value and therefore still omits gate.bias for
the first `num_hash_layers` layers.  To decide what to drop at load
time we must know the checkpoint's own value.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._dequantize(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]
```

Dequantize FP8 weights.  Handles both `.scale` and `_scale_inv` suffixes.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._dequantize_expert_fp4(
    weight: torch.Tensor,
    scale: torch.Tensor,
    dtype: torch.dtype
) -> torch.Tensor
```

staticmethod

Unpack FP4 e2m1 packed-int8 weight and apply the per-row / 32-col e8m0 scale.

Packed layout: `weight.int8` holds two FP4 values per byte — the low nibble
at even column index, the high nibble at the following odd column — so the
logical shape is `[out, weight.size(-1) * 2]`.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._dequantize_expert_weight(
    key: str,
    weight: torch.Tensor,
    scale: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._detect_checkpoint_expert_quant_layout() -> nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias(
    state_dict: dict[str, typing.Any],
    scope: '_HashBiasScope'
) -> dict[str, typing.Any]
```

The first `num_hash_layers` layers use hash-clustering routing and
their HF checkpoint has no `ffn.gate.bias` / `e_score_correction_bias`
tensor.  The model side, however, creates the bias parameter uniformly
for every layer (Automodel's generic Gate always materializes it when
`gate_bias_update_factor &gt; 0`).  Drop those bias keys before load so
DCP does not raise `Missing key in checkpoint state_dict` for them.

`scope` selects which key format to match — the pre-rename internal
form (`model.layers.&#123;i&#125;.mlp.gate.e_score_correction_bias`) or the
post-rename HF form (`layers.&#123;i&#125;.ffn.gate.bias`).

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._empty_or_cast_fp8(
    value: torch.Tensor
) -> torch.Tensor
```

staticmethod

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._expert_quant_layout_from_tensors(
    weight: torch.Tensor,
    scale: torch.Tensor
) -> nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._expert_scale_shape(
    weight: torch.Tensor
) -> tuple[int, int]
```

Scale shape for an FP4 routed-expert weight tensor.

The weight argument should be the *unpacked* tensor (in the model-side
state dict, experts are already materialized at full dtype), so its
last dim is the true `in` dim and the scale has `in // 32` columns.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._internal_key_to_hf(
    key: str
) -> str
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._is_expert_weight_key(
    key: str
) -> bool
```

staticmethod

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._is_non_quantized(
    hf_key: str
) -> bool
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._rename_all(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]
```

Apply the HF->internal rename table to every key.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._scale_shape(
    weight: torch.Tensor
) -> tuple[int, int]
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._scale_shape_from_shape(
    shape: torch.Size | tuple[int, ...]
) -> tuple[int, int]
```

staticmethod

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter._split_merged_expert(
    fqn: str,
    tensor: typing.Any
) -> list[tuple[str, typing.Any]]
```

Inverse of expert aggregation: split gate\_and\_up/down stacks into per-expert keys.

Handles DTensor inputs (EP-sharded) by working on the local shard only,
emitting keys only for the experts owned by the current rank.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert HF checkpoint to internal format.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.DeepSeekV4StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: str | None = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert internal state dict to HF V4 format.

Splits stacked expert weights back to per-expert w1/w2/w3 tensors,
applies key renaming in reverse, and optionally quantizes to FP8.

```python
class nemo_automodel.components.models.deepseek_v4.state_dict_adapter._ExpertQuantLayout
```

**Bases:** `enum.Enum`

On-disk routed-expert quantization layout for DeepSeek V4 checkpoints.

```python
class nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HashBiasScope
```

**Bases:** `enum.Enum`

Key-format scope for :meth:`DeepSeekV4StateDictAdapter._drop_hash_layer_gate_bias`.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter._rename_hf_key(
    key: str
) -> str
```

Apply simple rename rules; returns the key unchanged if no rule matches.

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter.FP4_COL_BLOCK = 32
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter._EXPERT_PATTERN = re.compile('^layers\\.(\\d+)\\.ffn\\.experts\\.(\\d+)\\.(w1|w2|w3)\\.weight$')
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter._FP4_E2M1_TABLE = torch.tensor([0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, 0.0, -0.5, -1.0, -1.5, -2....
```

```python
nemo_automodel.components.models.deepseek_v4.state_dict_adapter._HF_TO_INTERNAL_RENAMES: list[tuple[Pattern, str]] = [(re.compile('^embed\\.(.+)$'), 'model.embed_tokens.\\1'), (re.compile('^norm\\....
```