> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.moe.experts

## Module Contents

### Classes

| Name                                                                                          | Description                                                              |
| --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| [`GroupedExperts`](#nemo_automodel-components-moe-experts-GroupedExperts)                     | Sparse MoE implementation using all-gather/reduce-scatter primitives.    |
| [`GroupedExpertsDeepEP`](#nemo_automodel-components-moe-experts-GroupedExpertsDeepEP)         | Sparse MoE implementation using grouped GEMM with DeepEP token dispatch. |
| [`GroupedExpertsTE`](#nemo_automodel-components-moe-experts-GroupedExpertsTE)                 | MoE experts using TE's GroupedLinear module directly.                    |
| [`_AllGatherConcatVarlenFn`](#nemo_automodel-components-moe-experts-_AllGatherConcatVarlenFn) | All-gather with variable local lengths and autograd-safe backward.       |

### Functions

| Name                                                                                                          | Description                                                                   |
| ------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`_apply_bias`](#nemo_automodel-components-moe-experts-_apply_bias)                                           | Apply per-expert bias to grouped GEMM output.                                 |
| [`_init_weights`](#nemo_automodel-components-moe-experts-_init_weights)                                       | -                                                                             |
| [`_permute_tokens_for_grouped_mm`](#nemo_automodel-components-moe-experts-_permute_tokens_for_grouped_mm)     | Permute tokens by expert assignment and compute offs for torch.\_grouped\_mm. |
| [`_torch_mm_experts_fwd`](#nemo_automodel-components-moe-experts-_torch_mm_experts_fwd)                       | -                                                                             |
| [`get_expert_activation_for_deepep`](#nemo_automodel-components-moe-experts-get_expert_activation_for_deepep) | Return the DeepEP expert activation function selected by the MoE config.      |
| [`is_gated_activation`](#nemo_automodel-components-moe-experts-is_gated_activation)                           | Check if activation requires gating (gate\_proj + up\_proj).                  |
| [`quick_geglu_deepep`](#nemo_automodel-components-moe-experts-quick_geglu_deepep)                             | Apply DeepEP Quick-GEGLU activation and routing probabilities.                |
| [`relu2_deepep`](#nemo_automodel-components-moe-experts-relu2_deepep)                                         | ReLU² activation for DeepEP: relu(x)^2                                        |
| [`swiglu_clamped_deepep`](#nemo_automodel-components-moe-experts-swiglu_clamped_deepep)                       | Clamped SwiGLU (DeepSeek V4 style) for DeepEP.                                |
| [`swiglu_oai_deepep`](#nemo_automodel-components-moe-experts-swiglu_oai_deepep)                               | SwiGLU-OAI (GPT-OSS / MiniMax-M3) activation for grouped experts.             |

### API

```python
class nemo_automodel.components.moe.experts.GroupedExperts(
    config: nemo_automodel.components.moe.config.MoEConfig,
    backend: typing.Optional[nemo_automodel.components.models.common.utils.BackendConfig] = None
)
```

**Bases:** `Module`

Sparse MoE implementation using all-gather/reduce-scatter primitives.

Supports two compute backends:

* Per-expert loop with gather/scatter (default)
* torch.\_grouped\_mm with argsort-based permutation (backend.experts="torch\_mm")

```python
nemo_automodel.components.moe.experts.GroupedExperts._forward_grouped_mm(
    x,
    token_mask,
    weights,
    indices,
    gate_and_up_projs,
    down_projs,
    gate_up_proj_bias,
    down_proj_bias,
    n_local_experts,
    experts_start_idx
)
```

Grouped GEMM forward path using torch.\_grouped\_mm.

```python
nemo_automodel.components.moe.experts.GroupedExperts._forward_loop(
    x,
    weights,
    indices,
    token_mask,
    gate_and_up_projs,
    down_projs,
    gate_up_proj_bias,
    down_proj_bias,
    n_local_experts,
    experts_start_idx,
    experts_end_idx
)
```

Per-expert loop forward path using gather/scatter.

```python
nemo_automodel.components.moe.experts.GroupedExperts.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor,
    weights: torch.Tensor,
    indices: torch.Tensor
) -> torch.Tensor
```

Forward pass for the grouped experts.

**Parameters:**

Input tensor. Shape is \[num\_tokens, model\_dim].

Boolean mask indicating valid tokens.
Shape is \[num\_tokens].

Routing weights for the selected experts.
Shape is \[num\_tokens, num\_activated\_experts].

Indices of the selected experts.
Shape is \[num\_tokens, num\_activated\_experts].

**Returns:** `torch.Tensor`

torch.Tensor: Output tensor after expert computation.
Shape is \[num\_tokens, model\_dim]

```python
nemo_automodel.components.moe.experts.GroupedExperts.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
class nemo_automodel.components.moe.experts.GroupedExpertsDeepEP(
    config: nemo_automodel.components.moe.config.MoEConfig,
    backend: typing.Optional[nemo_automodel.components.models.common.utils.BackendConfig] = None,
    dispatcher_backend: str = 'deepep',
    dispatcher_num_sms: int = 20,
    dispatcher_share_token_dispatcher: bool = True,
    dispatcher_async_dispatch: bool = False
)
```

**Bases:** `Module`

Sparse MoE implementation using grouped GEMM with DeepEP token dispatch.

Supports two GEMM backends via BackendConfig.experts:

* grouped\_gemm.ops.gmm (experts="gmm", default)
* torch.\_grouped\_mm (experts="torch\_mm", no external dependency)

Once the experts for a particular token have been identified, this module
is invoked to compute and average the output of the activated experts.

```python
nemo_automodel.components.moe.experts.GroupedExpertsDeepEP._init_deepep_buffer(
    ep_group: torch.distributed.ProcessGroup
) -> None
```

Initialize DeepEP communication buffers before activation checkpointing.

```python
nemo_automodel.components.moe.experts.GroupedExpertsDeepEP.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor,
    weights: torch.Tensor,
    indices: torch.Tensor
) -> torch.Tensor
```

Forward pass for the grouped experts.

**Parameters:**

Input tensor. Shape is \[num\_tokens, model\_dim].

Boolean mask indicating valid tokens.
Shape is \[num\_tokens].

Routing weights for the selected experts.
Shape is \[num\_tokens, num\_activated\_experts].

Indices of the selected experts.
Shape is \[num\_tokens, num\_activated\_experts].

**Returns:** `torch.Tensor`

torch.Tensor: Output tensor after expert computation.
Shape is \[num\_tokens, model\_dim]

```python
nemo_automodel.components.moe.experts.GroupedExpertsDeepEP.init_token_dispatcher(
    ep_mesh: torch.distributed.device_mesh.DeviceMesh
)
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsDeepEP.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
class nemo_automodel.components.moe.experts.GroupedExpertsTE(
    config: nemo_automodel.components.moe.config.MoEConfig,
    backend: typing.Optional[nemo_automodel.components.models.common.utils.BackendConfig] = None,
    dispatcher_backend: str = 'deepep',
    dispatcher_num_sms: int = 20,
    dispatcher_share_token_dispatcher: bool = True,
    dispatcher_async_dispatch: bool = False
)
```

**Bases:** `Module`

MoE experts using TE's GroupedLinear module directly.

Uses TE's native GroupedLinear for computation, providing:

* Optimized grouped GEMM kernels from TE

For expert parallelism, each rank creates GroupedLinear with
num\_local\_experts = n\_routed\_experts / ep\_size.

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._get_stacked_bias(
    linear: transformer_engine.pytorch.GroupedLinear
) -> typing.Optional[torch.Tensor]
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._get_stacked_weight(
    linear: transformer_engine.pytorch.GroupedLinear,
    transpose: bool = False
) -> torch.Tensor
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._load_from_state_dict(
    state_dict: typing.Dict[str, typing.Any],
    prefix: str,
    local_metadata,
    strict,
    missing_keys,
    unexpected_keys,
    error_msgs
)
```

Load state dict with stacked tensors in DeepEP format.

Converts stacked format to TE GroupedLinear's weight\{i} parameters:

* gate\_and\_up\_projs: \[num\_local\_experts, dim, moe\_inter\_dim \* 2]
* down\_projs: \[num\_local\_experts, moe\_inter\_dim, dim]

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._normalize_moe_mesh(
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh]
) -> typing.Optional[torch.distributed.device_mesh.DeviceMesh]
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._set_stacked_bias(
    linear: transformer_engine.pytorch.GroupedLinear,
    stacked: torch.Tensor
)
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._set_stacked_weight(
    linear: transformer_engine.pytorch.GroupedLinear,
    stacked: torch.Tensor,
    transpose: bool = False
)
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE._to_ep_dtensor(
    tensor: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor,
    weights: torch.Tensor,
    indices: torch.Tensor
) -> torch.Tensor
```

Forward pass using TE's GroupedLinear with native FP8 support.

**Parameters:**

\[num\_tokens, model\_dim] input tensor

\[num\_tokens] boolean mask for valid tokens

\[num\_tokens, num\_activated\_experts] routing weights

\[num\_tokens, num\_activated\_experts] expert indices

**Returns:** `torch.Tensor`

\[num\_tokens, model\_dim] output tensor

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE.init_token_dispatcher(
    ep_mesh: torch.distributed.device_mesh.DeviceMesh,
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
)
```

Initialize the token dispatcher for expert parallelism.

Called by the parallelizer after model initialization.

**Parameters:**

Device mesh for expert parallelism.

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

Initialize weights using reset\_parameters()

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE.set_moe_mesh(
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh]
) -> None
```

```python
nemo_automodel.components.moe.experts.GroupedExpertsTE.state_dict(
    args = (),
    destination = None,
    prefix = '',
    keep_vars = False,
    kwargs = {}
) -> typing.Dict[str, typing.Any]
```

Return state dict with stacked tensors in DeepEP format.

Converts TE GroupedLinear's weight\{i} parameters to stacked format:

* gate\_and\_up\_projs: \[num\_local\_experts, dim, moe\_inter\_dim \* 2]
* down\_projs: \[num\_local\_experts, moe\_inter\_dim, dim]

When EP is enabled, returns DTensors sharded on dimension 0.

```python
class nemo_automodel.components.moe.experts._AllGatherConcatVarlenFn()
```

**Bases:** `Function`

All-gather with variable local lengths and autograd-safe backward.

Backward uses all-reduce + local narrow instead of reduce-scatter to avoid
monitoredBarrier deadlocks observed with mixed FSDP/EP backward collective ordering.

```python
nemo_automodel.components.moe.experts._AllGatherConcatVarlenFn.backward(
    ctx,
    grad_output: torch.Tensor
)
```

staticmethod

```python
nemo_automodel.components.moe.experts._AllGatherConcatVarlenFn.forward(
    ctx,
    local_tensor: torch.Tensor,
    group: torch.distributed.ProcessGroup,
    gathered_lens: list[int],
    max_len: int
)
```

staticmethod

```python
nemo_automodel.components.moe.experts._apply_bias(
    value,
    bias,
    tokens_per_expert,
    permuted_probs = None
)
```

Apply per-expert bias to grouped GEMM output.

NOTE: torch.\_grouped\_mm accepts a `bias` kwarg in its schema but raises
"RuntimeError: Bias not supported yet" as of PyTorch 2.9.0.
Additionally, down projection bias needs weighting by routing probs
(bias \* permuted\_probs) which native bias support wouldn't handle.

**Parameters:**

Output from grouped GEMM, shape \[total\_tokens, features].

Per-expert bias, shape \[num\_experts, features].

Token counts per expert.

If provided, bias is weighted by routing probs (for down projection).

```python
nemo_automodel.components.moe.experts._init_weights(
    module,
    buffer_device: torch.device,
    init_std: float = 0.02
)
```

```python
nemo_automodel.components.moe.experts._permute_tokens_for_grouped_mm(
    indices: torch.Tensor,
    weights: torch.Tensor,
    token_mask: torch.Tensor,
    n_local_experts: int,
    experts_start_idx: int
)
```

Permute tokens by expert assignment and compute offs for torch.\_grouped\_mm.

Takes the raw router outputs and produces sorted token IDs, routing weights,
tokens\_per\_expert counts, and cumulative offsets ready for grouped GEMM.

**Returns:**

Token indices sorted by expert assignment.

```python
nemo_automodel.components.moe.experts._torch_mm_experts_fwd(
    hidden_states,
    gate_and_up_projs,
    down_projs,
    tokens_per_expert,
    permuted_probs,
    activation_fn,
    use_mxfp8 = False
)
```

```python
nemo_automodel.components.moe.experts.get_expert_activation_for_deepep(
    config: nemo_automodel.components.moe.config.MoEConfig
)
```

Return the DeepEP expert activation function selected by the MoE config.

```python
nemo_automodel.components.moe.experts.is_gated_activation(
    activation: str
) -> bool
```

Check if activation requires gating (gate\_proj + up\_proj).

Gated activations (SwiGLU, Quick-GEGLU) use both gate\_proj and up\_proj,
requiring gate\_and\_up\_projs tensor with shape \[n\_experts, dim, 2\*inter\_dim].

Non-gated activations (ReLU²) only use up\_proj, requiring up\_projs tensor
with shape \[n\_experts, dim, inter\_dim] - 50% memory savings.

```python
nemo_automodel.components.moe.experts.quick_geglu_deepep(
    x,
    permuted_probs,
    alpha: float = 1.702,
    limit: float = 7.0,
    linear_offset: float = 1.0
)
```

Apply DeepEP Quick-GEGLU activation and routing probabilities.

```python
nemo_automodel.components.moe.experts.relu2_deepep(
    x,
    permuted_probs
)
```

ReLU² activation for DeepEP: relu(x)^2

For DeepEP with ReLU², x is the output of the up projection (already computed).
x already has shape \[..., inter\_dim] from efficient up\_proj.

```python
nemo_automodel.components.moe.experts.swiglu_clamped_deepep(
    x,
    permuted_probs,
    limit: float
)
```

Clamped SwiGLU (DeepSeek V4 style) for DeepEP.

Gate is clamped at `max=limit` and up at `(-limit, +limit)` in FP32
before `silu(gate) * up`; the result is multiplied by the permuted
routing probs and cast back.  Matches the official V4 Expert.forward::

gate = self.w1(x).float()
up   = self.w3(x).float()
if self.swiglu\_limit > 0:
up   = torch.clamp(up,   min=-swiglu\_limit, max=swiglu\_limit)
gate = torch.clamp(gate,                     max=swiglu\_limit)
y = F.silu(gate) \* up

`x` has shape `[..., 2 * inter_dim]` with gate in the first half
and up in the second half (same layout as `weighted_bias_swiglu_impl`).

```python
nemo_automodel.components.moe.experts.swiglu_oai_deepep(
    x,
    permuted_probs,
    alpha: float = 1.702,
    limit: float = 7.0
)
```

SwiGLU-OAI (GPT-OSS / MiniMax-M3) activation for grouped experts.

Computes `gate * sigmoid(alpha * gate) * (up + 1)` in fp32 with gate
clamped `max=limit` and up clamped `+/-limit` (when `limit &gt; 0`).

Unlike :func:`quick_geglu_deepep` (which expects an *interleaved* gate/up
layout, `x[..., ::2]` / `x[..., 1::2]`), this reads the *concatenated*
`[gate | up]` layout produced by `MoESplitExpertsStateDictMixin`
(`torch.cat([gate_t, up_t], dim=-1)`), matching sglang's
`swiglu_no_interleaved_with_alpha_and_limit`.