> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.step3p5.state_dict_adapter

State dict adapter for Step3p5 model.

Step3p5 uses grouped MoELinear weights with shape \[n\_exp, out, in], different from
the standard per-expert format. This adapter handles conversion between:

HF Format (Step3p5):
model.layers.\{L}.moe.gate\_proj.weight    # \[n\_exp, inter, dim]
model.layers.\{L}.moe.up\_proj.weight      # \[n\_exp, inter, dim]
model.layers.\{L}.moe.down\_proj.weight    # \[n\_exp, dim, inter]
model.layers.\{L}.moe.gate.weight         # \[n\_exp, dim] (router)
model.layers.\{L}.moe.router\_bias         # \[n\_exp] (post-sigmoid router correction bias, optional)
model.layers.\{L}.share\_expert.\*.weight   # Shared expert

Native Format (Automodel):
model.layers.\{L}.moe.experts.gate\_and\_up\_projs  # \[n\_exp, dim, 2*inter]
model.layers.\{L}.moe.experts.down\_projs         # \[n\_exp, inter, dim]
model.layers.\{L}.moe.gate.weight                # \[n\_exp, dim]
model.layers.\{L}.moe.gate.e\_score\_correction\_bias # \[n\_exp]
model.layers.\{L}.share\_expert.*.weight

Note: Router gate weights and shared expert weights pass through with the same key names.
Only the expert MLP weights (gate\_proj, up\_proj, down\_proj) need transformation.

## Module Contents

### Classes

| Name                                                                                                              | Description                                                                          |
| ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`Step3p5StateDictAdapter`](#nemo_automodel-components-models-step3p5-state_dict_adapter-Step3p5StateDictAdapter) | Converts between HF Step3p5 checkpoints and Automodel grouped-experts native format. |

### Functions

| Name                                                                                                                                              | Description                                                               |
| ------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| [`_create_dtensor_from_local_or_reference`](#nemo_automodel-components-models-step3p5-state_dict_adapter-_create_dtensor_from_local_or_reference) | Create a DTensor from a local tensor.                                     |
| [`_swap_shard_placements_1_2`](#nemo_automodel-components-models-step3p5-state_dict_adapter-_swap_shard_placements_1_2)                           | Swap Shard dim 1 and dim 2 in DTensor placements after a transpose(1, 2). |

### Data

[`logger`](#nemo_automodel-components-models-step3p5-state_dict_adapter-logger)

### API

```python
class nemo_automodel.components.models.step3p5.state_dict_adapter.Step3p5StateDictAdapter(
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

Converts between HF Step3p5 checkpoints and Automodel grouped-experts native format.

Step3p5 HF uses grouped MoELinear with shape \[n\_experts, out\_features, in\_features]:
model.layers.\{L}.moe.gate\_proj.weight  # \[n\_exp, inter, dim]
model.layers.\{L}.moe.up\_proj.weight    # \[n\_exp, inter, dim]
model.layers.\{L}.moe.down\_proj.weight  # \[n\_exp, dim, inter]

Prefix for HuggingFace format keys.

```python
nemo_automodel.components.models.step3p5.state_dict_adapter.Step3p5StateDictAdapter._convert_native_to_hf(
    fqn: str,
    tensor: torch.Tensor
) -> list[tuple[str, torch.Tensor]] | None
```

Convert native format expert tensors to HF Step3p5 format.

Native: gate\_and\_up\_projs \[n\_exp, dim, 2\*inter] -> HF: gate\_proj, up\_proj \[n\_exp, inter, dim]
Native: down\_projs \[n\_exp, inter, dim] -> HF: down\_proj \[n\_exp, dim, inter]

Preserves DTensor structure when input is a DTensor.

```python
nemo_automodel.components.models.step3p5.state_dict_adapter.Step3p5StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

Convert a single tensor from native format to HuggingFace format.

**Parameters:**

Fully qualified name of the tensor in native format

The tensor to convert

Additional arguments for conversion

**Returns:** `list[tuple[str, Any]]`

List of (fqn, tensor) tuples in HuggingFace format

```python
nemo_automodel.components.models.step3p5.state_dict_adapter.Step3p5StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert HF checkpoint to native format.

Handles Step3p5's grouped MoELinear format:

* \[n\_exp, inter, dim] gate\_proj/up\_proj -> \[n\_exp, dim, 2\*inter] gate\_and\_up\_projs
* \[n\_exp, dim, inter] down\_proj -> \[n\_exp, inter, dim] down\_projs

```python
nemo_automodel.components.models.step3p5.state_dict_adapter.Step3p5StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```

Convert from native model state dict to HuggingFace format.

```python
nemo_automodel.components.models.step3p5.state_dict_adapter._create_dtensor_from_local_or_reference(
    local_tensor: torch.Tensor,
    reference_dtensor: typing.Optional[torch.Tensor],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    rank: typing.Optional[int] = None,
    placements_override: typing.Optional[tuple] = None
) -> torch.Tensor
```

Create a DTensor from a local tensor.

Prefers using reference\_dtensor's mesh/placements if available (for preserving
DTensor structure from DCP-loaded tensors). Falls back to creating a new DTensor
using device\_mesh if reference is not a DTensor.

**Parameters:**

Local portion of the tensor after transformation

Optional DTensor to copy mesh/placements from

Device mesh for EP (used if reference is not DTensor)

Current rank for device placement

If provided, use these placements instead of the
reference DTensor's placements. Useful after transposing the local
tensor, where shard dimensions need to be swapped.

**Returns:** `torch.Tensor`

DTensor if mesh is available, otherwise local\_tensor

```python
nemo_automodel.components.models.step3p5.state_dict_adapter._swap_shard_placements_1_2(
    placements: tuple
) -> tuple
```

Swap Shard dim 1 and dim 2 in DTensor placements after a transpose(1, 2).

When we transpose a 3-D tensor's dims 1 and 2, any Shard placement on those
dims must be swapped so that `DTensor.from_local` infers the correct global
shape.  Without this, the shard multiplier is applied to the wrong axis.

```python
nemo_automodel.components.models.step3p5.state_dict_adapter.logger = logging.getLogger(__name__)
```