> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.mistral3_vlm.state_dict_adapter

State-dict adapter for the Mistral-3.5 128B (dawn-ridge) FP8 VLM.

Plugs into the standard nemo\_automodel checkpoint flow
(nemo\_automodel/components/checkpoint/checkpointing.py \~lines 510, 556) and
handles **FP8 dequantization** during load/save:

* The checkpoint's language\_model Linear weights are stored as per-tensor
  FP8 with a scalar `weight_scale_inv` sibling (and an unused
  `activation_scale` sibling). The adapter pairs each weight with its
  scale on load, dequantizes to bf16 (`w_bf16 = w_fp8.to(bf16) * scale`),
  and drops the scale keys. Vision tower + multi\_modal\_projector + lm\_head
  are BF16 on disk and pass through unchanged.

The live HF VLM module keeps the body under `model.*` while the checkpoint
stores text weights under `language_model.model.*` and top-level VLM
components as `vision_tower.*` / `multi_modal_projector.*`. The LM head is
also nested on disk as `language_model.lm_head.weight` while the runtime
module exposes it as `lm_head.weight`.

Structurally modelled after
`nemo_automodel/components/models/deepseek_v3/state_dict_adapter.py`.

## Module Contents

### Classes

| Name                                                                                                                           | Description                                                  |
| ------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------ |
| [`Mistral3FP8StateDictAdapter`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-Mistral3FP8StateDictAdapter) | FP8 dequant adapter for the Mistral-3.5 128B dawn-ridge VLM. |

### Functions

| Name                                                                                                                 | Description                                                       |
| -------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| [`_dequantize_from_fp8`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_dequantize_from_fp8)     | Dequantize a single FP8 weight using its per-tensor scalar scale. |
| [`_identity`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_identity)                           | -                                                                 |
| [`_is_fp8_weight_key`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_is_fp8_weight_key)         | Return True iff `model_key` names an FP8 Linear weight.           |
| [`_vlm_full_hf_to_native`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_vlm_full_hf_to_native) | Map checkpoint VLM names back to runtime parameter names.         |
| [`_vlm_full_native_to_hf`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_vlm_full_native_to_hf) | Map runtime VLM parameter names to checkpoint names.              |

### Data

[`_HF_LM_HEAD_KEY`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_HF_LM_HEAD_KEY)

[`_MODEL_LM_HEAD_KEY`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_MODEL_LM_HEAD_KEY)

[`_NON_QUANTIZED_SUFFIXES`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-_NON_QUANTIZED_SUFFIXES)

[`logger`](#nemo_automodel-components-models-mistral3_vlm-state_dict_adapter-logger)

### API

```python
class nemo_automodel.components.models.mistral3_vlm.state_dict_adapter.Mistral3FP8StateDictAdapter(
    native_to_hf: typing.Callable[[str], str] = _identity,
    hf_to_native: typing.Callable[[str], str] = _identity,
    layout_name: str = 'vlm_full',
    not_fp8_prefixes: tuple[str, ...] = ()
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

FP8 dequant adapter for the Mistral-3.5 128B dawn-ridge VLM.

Keys round-trip identity (HF state\_dict and on-disk keys match for the
full VLM). Only language\_model layer weights are FP8; vision\_tower,
multi\_modal\_projector, and lm\_head are BF16 and pass through unchanged
via the `not_fp8_prefixes` / `_NON_QUANTIZED_SUFFIXES` filters.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter.Mistral3FP8StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

Per-tensor model → HF used by `Checkpointer.save_model`.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter.Mistral3FP8StateDictAdapter.for_vlm_full() -> 'Mistral3FP8StateDictAdapter'
```

classmethod

Full-VLM path for Mistral3ForConditionalGeneration checkpoints.

The runtime module keeps VLM body modules under `model.*` but the
checkpoint stores text weights under `language_model.model.*` and
non-text component names at top level. The **LM head** has one extra
quirk: the model exposes it at the top level (`lm_head.weight`) while
the checkpoint nests it (`language_model.lm_head.weight`).
Tied checkpoints (Ministral-3) never serialize the head, so the head
translation is a harmless no-op there; untied checkpoints (Devstral-24B)
rely on it to find the head during the DCP load.

Only the language\_model layer weights are FP8; vision / mm\_projector /
lm\_head are BF16 on disk and must be passed through without a scale\_inv
placeholder — otherwise DCP would fail trying to fetch a non-existent
`_scale_inv` key.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter.Mistral3FP8StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional['DeviceMesh'] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert an HF-format (possibly FP8) state dict to model-native format.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter.Mistral3FP8StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert a model-native state dict to HF (on-disk) layout.

When `quantization=True` the weight placeholder is also cast to
`torch.float8_e4m3fn` so the DCP storage reader fetches FP8 bytes
verbatim from safetensors (a bf16 target would silently cast-on-read
and lose the scale multiply — see deepseek\_v3/state\_dict\_adapter.py:220).
A scalar `_scale_inv` placeholder is also emitted so DCP pulls it
alongside the weight.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._dequantize_from_fp8(
    weight_fp8: torch.Tensor,
    scale_inv: torch.Tensor,
    target_dtype: torch.dtype = torch.bfloat16
) -> torch.Tensor
```

Dequantize a single FP8 weight using its per-tensor scalar scale.

The dawn-ridge 128B checkpoint uses per-tensor quantization
(`weight_block_size=None`), so `scale_inv` is a 0-d scalar and
dequant collapses to a simple multiply. The per-block formula
(`transformers.integrations.finegrained_fp8.Fp8Dequantize.convert`,
finegrained\_fp8.py:867-906) is not needed here.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._identity(
    k: str
) -> str
```

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._is_fp8_weight_key(
    model_key: str,
    not_fp8_prefixes: tuple[str, ...] = ()
) -> bool
```

Return True iff `model_key` names an FP8 Linear weight.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._vlm_full_hf_to_native(
    hf_key: str
) -> str
```

Map checkpoint VLM names back to runtime parameter names.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._vlm_full_native_to_hf(
    model_key: str
) -> str
```

Map runtime VLM parameter names to checkpoint names.

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._HF_LM_HEAD_KEY = 'language_model.lm_head.weight'
```

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._MODEL_LM_HEAD_KEY = 'lm_head.weight'
```

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter._NON_QUANTIZED_SUFFIXES = ('embed_tokens.weight', 'lm_head.weight', 'input_layernorm.weight', 'post_attent...
```

```python
nemo_automodel.components.models.mistral3_vlm.state_dict_adapter.logger = logging.getLogger(__name__)
```