> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.gemma4_moe.state_dict_adapter

State-dict adapter for Gemma4 MoE.

HF Gemma4 MoE (eevee-4 26B-A4B) stores expert weights as 3-D tensors:

layers.\{L}.moe.gate\_up\_proj       # \[n\_experts, 2\*expert\_inter\_size, hidden\_size]
layers.\{L}.moe.down\_proj          # \[n\_experts, hidden\_size, expert\_inter\_size]
layers.\{L}.moe.per\_expert\_scale   # \[n\_experts]

NeMo uses transposed layout with concatenated gate+up:

layers.\{L}.moe.experts.gate\_and\_up\_projs  # \[n\_experts, hidden\_size, 2\*expert\_inter\_size]
layers.\{L}.moe.experts.down\_projs         # \[n\_experts, expert\_inter\_size, hidden\_size]

Additionally, the Gemma4 router is mapped to the NeMo Gemma4Gate:

HF:   .router.proj.weight / .router.scale
NeMo: .moe.gate.proj.weight / .moe.gate.scale

The per\_expert\_scale is absorbed into down\_projs during from\_hf.  When
saving back to HF, per\_expert\_scale is emitted as ones (scale already baked
into the weights).

## Module Contents

### Classes

| Name                                                                                                                     | Description                                                     |
| ------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------- |
| [`Gemma4MoEStateDictAdapter`](#nemo_automodel-components-models-gemma4_moe-state_dict_adapter-Gemma4MoEStateDictAdapter) | Converts between HF Gemma4 MoE checkpoints and the NeMo format. |

### API

```python
class nemo_automodel.components.models.gemma4_moe.state_dict_adapter.Gemma4MoEStateDictAdapter(
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

**Bases:** [StateDictAdapter](/nemo-automodel/nemo_automodel/components/checkpoint/state_dict_adapter#nemo_automodel-components-checkpoint-state_dict_adapter-StateDictAdapter)

Converts between HF Gemma4 MoE checkpoints and the NeMo format.

```python
nemo_automodel.components.models.gemma4_moe.state_dict_adapter.Gemma4MoEStateDictAdapter._gather_expert_tensor(
    tensor: torch.Tensor,
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh],
    n_experts: int
) -> torch.Tensor
```

Gather EP-sharded expert tensor across ranks into a full tensor.

```python
nemo_automodel.components.models.gemma4_moe.state_dict_adapter.Gemma4MoEStateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]
```

Convert a single native tensor back to HF format.

Handles per-tensor conversion for weight streaming (IPC refit) required in RL training:

* Router keys: moe.gate.\{proj.weight,scale} -> router.\{proj.weight,scale}
* Expert gate\_and\_up\_projs: transpose \[E, hidden, 2*inter] -> \[E, 2*inter, hidden]
  and rename to experts.gate\_up\_proj
* Expert down\_projs: transpose \[E, inter, hidden] -> \[E, hidden, inter],
  rename to experts.down\_proj, and emit router.per\_expert\_scale as ones

```python
nemo_automodel.components.models.gemma4_moe.state_dict_adapter.Gemma4MoEStateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]
```

```python
nemo_automodel.components.models.gemma4_moe.state_dict_adapter.Gemma4MoEStateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]
```