> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter

State-dict adapter for the MiniMax M3 (text) backbone.

Converts between the released HF checkpoint layout and the native AutoModel
layout:

* `block_sparse_moe.&#123;gate,e_score_correction_bias&#125;` -> `mlp.gate.*`
* `block_sparse_moe.experts.&#123;e&#125;.&#123;w1,w3,w2&#125;` -> grouped `mlp.experts.*`
  (gate/up/down) via `MoESplitExpertsStateDictMixin`
* `block_sparse_moe.shared_experts.*` -> `shared_experts.*` (a sibling of
  `mlp` on the decoder block)
* dense (non-MoE) layers keep `mlp.&#123;gate,up,down&#125;_proj.*` unchanged

MXFP8 weights (FP8 e4m3 + `*_scale_inv` stored as e8m0/uint8, block `[1,32]`
along the input dim) are dequantized to `dtype` on load (Q2 decision: train in
BF16). Stage 1 drops the sparse-attention index branch (`self_attn.index_*`)
and MTP (`mtp.*`) tensors; those are wired in Stages 2 and 4.

## Module Contents

### Classes

| Name                                                                                                                            | Description                                                                    |
| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| [`MiniMaxM3StateDictAdapter`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-MiniMaxM3StateDictAdapter)     | Convert MiniMax M3 HF checkpoints to/from the native grouped-expert format.    |
| [`MiniMaxM3VLStateDictAdapter`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-MiniMaxM3VLStateDictAdapter) | VLM adapter: splits the M3 VL checkpoint into text / vision / projector parts. |

### Functions

| Name                                                                                                                                  | Description                                                                          |
| ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`_dequantize_mxfp8_local`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-_dequantize_mxfp8_local)               | -                                                                                    |
| [`_should_quantize_mxfp8_key`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-_should_quantize_mxfp8_key)         | True for HF-format weight keys stored as MXFP8 in the checkpoint.                    |
| [`_slice_mxfp8_scale_for_dtensor`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-_slice_mxfp8_scale_for_dtensor) | Slice a global scale\_inv to a DTensor weight's local shard.                         |
| [`create_mxfp8_scale_inv`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-create_mxfp8_scale_inv)                 | Load-time placeholder scale\_inv (e8m0/uint8, GLOBAL shape `[out, ceil(in/block)]`). |
| [`dequantize_mxfp8`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-dequantize_mxfp8)                             | Dequantize an MXFP8 weight (FP8 e4m3 + e8m0/uint8 block scales) to `dtype`.          |

### Data

[`MXFP8_BLOCK_SIZE`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-MXFP8_BLOCK_SIZE)

[`_MXFP8_QUANT_KEY_RE`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-_MXFP8_QUANT_KEY_RE)

[`_MXFP8_SCALE_INV_IDENTITY`](#nemo_automodel-components-models-minimax_m3_vl-state_dict_adapter-_MXFP8_SCALE_INV_IDENTITY)

### API

```python
class nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter(
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.bfloat16
)
```

**Bases:** [MoESplitExpertsStateDictMixin](/nemo-automodel/nemo_automodel/components/moe/state_dict_mixin#nemo_automodel-components-moe-state_dict_mixin-MoESplitExpertsStateDictMixin), [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

Convert MiniMax M3 HF checkpoints to/from the native grouped-expert format.

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter._dequantize(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter._hf_key_to_native(
    key: str
) -> str
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter._mtp_from_hf(
    mtp_keys: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
) -> dict[str, typing.Any]
```

Convert MTP tensors: the transformer\_layer reuses the full text from\_hf
(as a fake 1-layer model, so expert-merge / index / dequant all apply); the
enorm/hnorm/eh\_proj/final\_layernorm fusion tensors pass through (eh\_proj is FP8).

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter._mtp_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter._native_key_to_hf(
    key: str
) -> str
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert an HF checkpoint to native format (operates in-place to limit peak memory).

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

```python
class nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3VLStateDictAdapter(
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.bfloat16
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

VLM adapter: splits the M3 VL checkpoint into text / vision / projector parts.

The released checkpoint stores the language backbone under
`language_model.model.*` / `language_model.lm_head` and the vision side
under `vision_tower.vision_model.*` with the projector / patch-merger at
top level (`multi_modal_projector.*` / `patch_merge_mlp.*`).  The native
VLM keeps the text model at `model.*` / `lm_head` and nests the projector
/ merger under `vision_tower.*`.  Text tensors are delegated to
:class:`MiniMaxM3StateDictAdapter` (block\_sparse\_moe -> mlp, index branch,
MXFP8 dequant, grouped experts); vision tensors are BF16 and pass through.

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3VLStateDictAdapter._map_non_text_from_hf(
    key: str
) -> str | None
```

staticmethod

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3VLStateDictAdapter._map_non_text_to_hf(
    key: str
) -> str
```

staticmethod

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3VLStateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3VLStateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MiniMaxM3VLStateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex = None,
    quantization: bool = False,
    kwargs = {}
)
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter._dequantize_mxfp8_local(
    w_local: torch.Tensor,
    scale_local: torch.Tensor,
    block_size: int,
    dtype
) -> torch.Tensor
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter._should_quantize_mxfp8_key(
    key: str
) -> bool
```

True for HF-format weight keys stored as MXFP8 in the checkpoint.

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter._slice_mxfp8_scale_for_dtensor(
    scale_inv: torch.Tensor,
    weight_dtensor: torch.Tensor,
    weight_local: torch.Tensor,
    block_size: int
) -> torch.Tensor
```

Slice a global scale\_inv to a DTensor weight's local shard.

MXFP8 block is `[1, block_size]`: dim 0 (out) is full-resolution (block 1, so a
row range maps 1:1) and dim 1 (in) is grouped by `block_size`. Custom MoE is
always tp=1, so sharding is on dim 0 (FSDP / ep\_shard); dim 1 handled for safety.

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.create_mxfp8_scale_inv(
    weight: torch.Tensor,
    block_size: int = MXFP8_BLOCK_SIZE
) -> torch.Tensor
```

Load-time placeholder scale\_inv (e8m0/uint8, GLOBAL shape `[out, ceil(in/block)]`).

Emitted by `to_hf(quantization=True)` so the DCP planner requests the
checkpoint's `*_scale_inv` tensors; the values here are overwritten by the
load. Kept a regular (non-DTensor) tensor with global shape -- the per-shard
slice happens in `dequantize_mxfp8` (mirrors deepseek\_v3).

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.dequantize_mxfp8(
    weight: torch.Tensor,
    scale_inv: torch.Tensor,
    block_size: int = MXFP8_BLOCK_SIZE,
    dtype: torch.dtype = torch.bfloat16
) -> torch.Tensor
```

Dequantize an MXFP8 weight (FP8 e4m3 + e8m0/uint8 block scales) to `dtype`.

`weight` is FP8 `e4m3` `[out, in]`; `scale_inv` holds e8m0 (uint8)
exponents `[out, ceil(in/block_size)]` with dequant scale for input-block
`b` = `2 ** (scale_inv[:, b] - 127)` (MX e8m0; confirmed vs sglang). Handles
DTensor weights: the local shard is dequantized against the matching slice of a
global `scale_inv` and rewrapped with the weight's placements.

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter.MXFP8_BLOCK_SIZE = 32
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter._MXFP8_QUANT_KEY_RE = re.compile('\\.layers\\.\\d+\\.(?:self_attn\\.[qkvo]_proj|mlp\\.(?:gate|up|down)...
```

```python
nemo_automodel.components.models.minimax_m3_vl.state_dict_adapter._MXFP8_SCALE_INV_IDENTITY = 127
```