> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.moe.layers

## Module Contents

### Classes

| Name                                                                         | Description                                                                     |
| ---------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [`FakeBalancedGate`](#nemo_automodel-components-moe-layers-FakeBalancedGate) | Load balanced gate implementation, spreads tokens uniformly across all experts. |
| [`Gate`](#nemo_automodel-components-moe-layers-Gate)                         | Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.        |
| [`MLP`](#nemo_automodel-components-moe-layers-MLP)                           | Multi-Layer Perceptron (MLP) used as a feed-forward layer.                      |
| [`MoE`](#nemo_automodel-components-moe-layers-MoE)                           | Mixture-of-Experts (MoE) module.                                                |

### Functions

| Name                                                                   | Description |
| ---------------------------------------------------------------------- | ----------- |
| [`_init_weights`](#nemo_automodel-components-moe-layers-_init_weights) | -           |

### API

```python
class nemo_automodel.components.moe.layers.FakeBalancedGate(
    config: nemo_automodel.components.moe.config.MoEConfig,
    skip_first_n_experts: int = 0,
    noise: float = 0.0
)
```

**Bases:** `Module`

Load balanced gate implementation, spreads tokens uniformly across all experts.
The rationale for this class is to do performance experiments to understand
how the load imbalance with real data is impacting end-to-end performance.

When `noise &gt; 0`, random perturbation is added to mimic realistic routing
imbalance.  A noise value of 0.0 gives perfectly balanced assignment, while
1.0 gives fully random expert selection and non-uniform weights.

```python
nemo_automodel.components.moe.layers.FakeBalancedGate.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor,
    cp_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh]
) -> tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]
```

Forward pass for the gating mechanism.

**Parameters:**

Input tensor.

Boolean mask indicating valid tokens.

Device mesh for context parallel computation.

**Returns:** `torch.Tensor`

Routing weights for the selected experts.

```python
nemo_automodel.components.moe.layers.FakeBalancedGate.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
nemo_automodel.components.moe.layers.FakeBalancedGate.update_bias() -> None
```

```python
class nemo_automodel.components.moe.layers.Gate(
    config: nemo_automodel.components.moe.config.MoEConfig,
    gate_precision: torch.dtype | None = None
)
```

**Bases:** `Module`

Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.

```python
nemo_automodel.components.moe.layers.Gate._compute_aux_loss(
    original_scores: torch.Tensor,
    expert_load: torch.Tensor,
    token_mask: torch.Tensor,
    cp_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh]
) -> torch.Tensor
```

Computes the auxiliary loss for load balancing.

**Warning**: Assumes batch size = 1, if batch size > 1, the aux\_loss will
be computed across multiple sequences.

**Parameters:**

Original scores from the gating mechanism.
Shape is \[num\_tokens, num\_experts].

Load of each expert (number of tokens routed to each expert).
Shape is \[num\_experts].

Boolean mask indicating valid tokens.
Shape is \[num\_tokens].

Device mesh for context parallel computation.

**Returns:** `torch.Tensor`

torch.Tensor: Auxiliary loss for load balancing.
Shape is \[].

```python
nemo_automodel.components.moe.layers.Gate._compute_expert_load(
    indices: torch.Tensor,
    token_mask: torch.Tensor
) -> torch.Tensor
```

Computes the load of each expert based on the selected indices.
Args:
indices (torch.Tensor): Indices of the selected experts.
Shape is \[num\_tokens, num\_activated\_experts].
token\_mask (torch.Tensor): Boolean mask indicating valid tokens.
Shape is \[num\_tokens].

**Returns:** `torch.Tensor`

torch.Tensor: Load of each expert (number of tokens routed to each expert).
Shape is \[num\_local\_experts].

```python
nemo_automodel.components.moe.layers.Gate.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor,
    cp_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh]
) -> tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]
```

Forward pass for the gating mechanism.

**Parameters:**

Input tensor.

Boolean mask indicating valid tokens.

Device mesh for context parallel computation.

**Returns:** `torch.Tensor`

Routing weights for the selected experts.

```python
nemo_automodel.components.moe.layers.Gate.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
nemo_automodel.components.moe.layers.Gate.update_bias() -> None
```

Updates the correction bias used in the gate based on the popularity of experts.
This function is a NoOp if the gate is not trained.

To avoid routing collapse, and to promote better load balance of experts,
DeepSeek-V3 uses a correction mechanism to adjust the scores of experts using
a learned bias parameter. The bias parameter is updated based on the popularity
of experts, i.e., the number of tokens routed to each expert. If an expert is
more popular than the average, its bias term is decreased, and vice versa.
This encourages the model to route tokens to less popular experts, promoting
better load balance.

```python
class nemo_automodel.components.moe.layers.MLP(
    dim: int,
    inter_dim: int,
    backend: str,
    dtype: torch.dtype = torch.bfloat16,
    activation: str = 'swiglu',
    bias: bool = False,
    swiglu_limit: float = 0.0
)
```

**Bases:** `Module`

Multi-Layer Perceptron (MLP) used as a feed-forward layer.

Supports both gated activations (SwiGLU) and simple activations (ReLU²).

```python
nemo_automodel.components.moe.layers.MLP.forward(
    x: torch.Tensor
) -> torch.Tensor
```

Forward pass for the MLP layer.

**Parameters:**

Input tensor.

**Returns:** `torch.Tensor`

torch.Tensor: Output tensor after MLP computation.

```python
nemo_automodel.components.moe.layers.MLP.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
class nemo_automodel.components.moe.layers.MoE(
    config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Mixture-of-Experts (MoE) module.

```python
nemo_automodel.components.moe.layers.MoE.forward(
    x: torch.Tensor,
    padding_mask: typing.Optional[torch.Tensor] = None,
    cp_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
) -> tuple[torch.Tensor, typing.Optional[torch.Tensor]]
```

Forward pass for the MoE module.

**Parameters:**

Input tensor.

Boolean mask indicating padding positions.

**Returns:** `torch.Tensor`

torch.Tensor: Output tensor after expert routing and computation.

```python
nemo_automodel.components.moe.layers.MoE.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
nemo_automodel.components.moe.layers._init_weights(
    module,
    buffer_device: torch.device,
    init_std: float = 0.02
)
```