> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.mistral4.state_dict_adapter

## Module Contents

### Classes

| Name                                                                                                                                     | Description                                                                          |
| ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`Mistral4MultimodalStateDictAdapter`](#nemo_automodel-components-models-mistral4-state_dict_adapter-Mistral4MultimodalStateDictAdapter) | State dict adapter for the full **multimodal** Mistral 4 (ForConditionalGeneration). |
| [`Mistral4StateDictAdapter`](#nemo_automodel-components-models-mistral4-state_dict_adapter-Mistral4StateDictAdapter)                     | State dict adapter for Mistral 4 **text-only** (CausalLM).                           |

### Functions

| Name                                                                                                                       | Description                                                                      |
| -------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| [`_convert_aggregated_experts`](#nemo_automodel-components-models-mistral4-state_dict_adapter-_convert_aggregated_experts) | Convert aggregated expert weights from HF format to native format.               |
| [`_dequantize_state_dict`](#nemo_automodel-components-models-mistral4-state_dict_adapter-_dequantize_state_dict)           | Dequantize FP8 weights in-place. Handles both per-tensor and block-wise formats. |
| [`_inject_missing_gate_bias`](#nemo_automodel-components-models-mistral4-state_dict_adapter-_inject_missing_gate_bias)     | Inject zero `e_score_correction_bias` for MoE layers that lack it.               |
| [`_should_quantize_key`](#nemo_automodel-components-models-mistral4-state_dict_adapter-_should_quantize_key)               | Check if a key should be quantized based on its name.                            |

### Data

[`_HF_PREFIX`](#nemo_automodel-components-models-mistral4-state_dict_adapter-_HF_PREFIX)

[`_NON_QUANTIZED_PATTERNS`](#nemo_automodel-components-models-mistral4-state_dict_adapter-_NON_QUANTIZED_PATTERNS)

[`logger`](#nemo_automodel-components-models-mistral4-state_dict_adapter-logger)

### API

```python
class nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4MultimodalStateDictAdapter(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

State dict adapter for the full **multimodal** Mistral 4 (ForConditionalGeneration).

Checkpoint key prefixes → native model key prefixes:
`language_model.model.X`    → `model.language_model.X`  (text backbone)
`language_model.lm_head.X`  → `lm_head.X`              (LM head)
`vision_tower.X`            → `model.vision_tower.X`    (Pixtral)
`multi_modal_projector.X`   → `model.multi_modal_projector.X`

FP8 dequantization is applied only to text-model weights (vision/projector are not quantized).
Expert weights are converted from aggregated 3D format to native format.

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4MultimodalStateDictAdapter._remap_keys_from_hf(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]
```

Remap checkpoint keys to native model keys.

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4MultimodalStateDictAdapter._remap_keys_to_hf(
    key: str
) -> str
```

Remap a single native key back to checkpoint format.

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4MultimodalStateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4MultimodalStateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert HF checkpoint to native format.

Pipeline:

1. Remap checkpoint keys to native model keys
2. Dequantize FP8 weights (text model only; vision/projector are not quantized)
3. Convert aggregated expert weights to native format

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4MultimodalStateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

```python
class nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4StateDictAdapter(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

State dict adapter for Mistral 4 **text-only** (CausalLM).

Handles:

1. Stripping `language_model.` prefix from HF keys
2. FP8 dequantization (per-tensor and block-wise)
3. Aggregated expert weight conversion (3D tensors → native format)
4. Removing activation scale keys

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4StateDictAdapter._strip_prefix(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]
```

Strip `language_model.` prefix from all keys.

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.Mistral4StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

```python
nemo_automodel.components.models.mistral4.state_dict_adapter._convert_aggregated_experts(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]
```

Convert aggregated expert weights from HF format to native format.

HF format (aggregated 3D tensors):
mlp.experts.gate\_up\_proj  \[128, 2\*moe\_inter\_dim, hidden\_size]
mlp.experts.down\_proj     \[128, hidden\_size, moe\_inter\_dim]

```python
nemo_automodel.components.models.mistral4.state_dict_adapter._dequantize_state_dict(
    state_dict: dict[str, typing.Any],
    dtype: torch.dtype
) -> dict[str, typing.Any]
```

Dequantize FP8 weights in-place. Handles both per-tensor and block-wise formats.

Mistral 4 HF checkpoint has two FP8 patterns:

* Standard weights: `*.weight` + `*.weight_scale_inv` (attention, shared experts)
* Expert weights: `mlp.experts.gate_up_proj` + `mlp.experts.gate_up_proj_scale_inv` (no .weight suffix)

```python
nemo_automodel.components.models.mistral4.state_dict_adapter._inject_missing_gate_bias(
    state_dict: dict[str, typing.Any],
    n_routed_experts: int
) -> dict[str, typing.Any]
```

Inject zero `e_score_correction_bias` for MoE layers that lack it.

Some checkpoints (e.g. vv4) don't include the gate bias — it starts at zero
and is learned during training.  The model always expects the key, so we
inject `torch.zeros(n_routed_experts)` for any layer that has a gate weight
but no bias.

```python
nemo_automodel.components.models.mistral4.state_dict_adapter._should_quantize_key(
    key: str
) -> bool
```

Check if a key should be quantized based on its name.

Handles both standard keys (*.weight) and Mistral4 aggregated expert keys
(*.gate\_up\_proj, \*.down\_proj) which don't have a .weight suffix.
Only text model weights are FP8; vision tower, projector, and lm\_head are not.

```python
nemo_automodel.components.models.mistral4.state_dict_adapter._HF_PREFIX = 'language_model.'
```

```python
nemo_automodel.components.models.mistral4.state_dict_adapter._NON_QUANTIZED_PATTERNS = ['input_layernorm.weight', 'post_attention_layernorm.weight', 'norm.weight', 'lm...
```

```python
nemo_automodel.components.models.mistral4.state_dict_adapter.logger = logging.getLogger(__name__)
```