> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter

State-dict adapter for Qwen3.5-MoE.

HF Qwen3.5-MoE stores expert weights as **aggregated 3-D tensors**:

model.language\_model.layers.\{L}.mlp.experts.gate\_up\_proj   # \[n\_experts, 2\*moe\_inter, hidden]
model.language\_model.layers.\{L}.mlp.experts.down\_proj      # \[n\_experts, hidden, moe\_inter]

NeMo uses a different naming convention **and transposed layout** (x @ weight):

model.language\_model.layers.\{L}.mlp.experts.gate\_and\_up\_projs  # \[n\_experts, hidden, 2\*moe\_inter]
model.language\_model.layers.\{L}.mlp.experts.down\_projs         # \[n\_experts, moe\_inter, hidden]

Both expert tensors require `.transpose(1, 2)` when converting between formats.

Additionally, the shared expert uses singular in HF and plural in NeMo:

HF:   .mlp.shared\_expert.\{gate,up,down}\_proj.weight
NeMo: .mlp.shared\_experts.\{gate,up,down}\_proj.weight

All other keys (attention, linear\_attn/GatedDeltaNet, norms, embeddings, vision
encoder) pass through unchanged. The HF VLM checkpoint stores the language
model head as `model.lm_head` while Automodel registers it on the outer model
as `lm_head`.

## Module Contents

### Classes

| Name                                                                                                                        | Description                                                             |
| --------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| [`Qwen3_5MoeStateDictAdapter`](#nemo_automodel-components-models-qwen3_5_moe-state_dict_adapter-Qwen3_5MoeStateDictAdapter) | Converts between HF Qwen3.5-MoE checkpoints and the NeMo native format. |

### Functions

| Name                                                                                                        | Description                                                           |
| ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`_route_fp32_params`](#nemo_automodel-components-models-qwen3_5_moe-state_dict_adapter-_route_fp32_params) | Route bare GDN fp32 params into the holder used by the native module. |
| [`_strip_fp32_params`](#nemo_automodel-components-models-qwen3_5_moe-state_dict_adapter-_strip_fp32_params) | Strip the fp32 holder segment from GDN state-dict keys.               |

### API

```python
class nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter.Qwen3_5MoeStateDictAdapter(
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

Converts between HF Qwen3.5-MoE checkpoints and the NeMo native format.

HF Qwen3.5-MoE stores expert weights as **aggregated 3-D tensors**:

model.language\_model.layers.\{L}.mlp.experts.gate\_up\_proj   # \[n\_experts, 2\*moe\_inter, hidden]
model.language\_model.layers.\{L}.mlp.experts.down\_proj      # \[n\_experts, hidden, moe\_inter]

NeMo uses a different naming convention **and transposed layout** (x @ weight):

model.language\_model.layers.\{L}.mlp.experts.gate\_and\_up\_projs  # \[n\_experts, hidden, 2\*moe\_inter]
model.language\_model.layers.\{L}.mlp.experts.down\_projs         # \[n\_experts, moe\_inter, hidden]

Both expert tensors require `.transpose(1, 2)` when converting between formats.

Additionally, the shared expert uses singular in HF and plural in NeMo:

HF:   .mlp.shared\_expert.\{gate,up,down}\_proj.weight
NeMo: .mlp.shared\_experts.\{gate,up,down}\_proj.weight

```python
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter.Qwen3_5MoeStateDictAdapter._apply_key_mapping(
    state_dict: dict[str, typing.Any],
    mapping: dict[str, str]
) -> dict[str, typing.Any]
```

Apply key substring mappings to state dict keys.

```python
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter.Qwen3_5MoeStateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

Rename a single native key to HF format and transpose expert tensors.

```python
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter.Qwen3_5MoeStateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

Rename HF keys to native keys and transpose expert tensors.

DTensors (DCP path): rename + transpose, no slicing — DCP handles sharding.
Plain tensors (init path): slice to local EP shard, transpose, create DTensor.

```python
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter.Qwen3_5MoeStateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

Rename native keys to HF keys and transpose expert tensors. No comms needed.

```python
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter._route_fp32_params(
    key: str
) -> str
```

Route bare GDN fp32 params into the holder used by the native module.

```python
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter._strip_fp32_params(
    key: str
) -> str
```

Strip the fp32 holder segment from GDN state-dict keys.