> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.moe.megatron.moe_utils

## Module Contents

### Classes

| Name                                                                                                                 | Description                                                                           |
| -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| [`MoEAuxLossAutoScaler`](#nemo_automodel-components-moe-megatron-moe_utils-MoEAuxLossAutoScaler)                     | An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss. |
| [`WeightedBiasQuickGeGLUFunction`](#nemo_automodel-components-moe-megatron-moe_utils-WeightedBiasQuickGeGLUFunction) | Autograd function for token-wise weighted Quick-GEGLU with bias support.              |
| [`WeightedGEGLUFunction`](#nemo_automodel-components-moe-megatron-moe_utils-WeightedGEGLUFunction)                   | Autograd function for token-wise weighted GEGLU.                                      |
| [`WeightedQuickGeGLUFunction`](#nemo_automodel-components-moe-megatron-moe_utils-WeightedQuickGeGLUFunction)         | Autograd function for token-wise weighted Quick-GEGLU (no bias).                      |
| [`WeightedSwiGLUFunction`](#nemo_automodel-components-moe-megatron-moe_utils-WeightedSwiGLUFunction)                 | Autograd function for token-wise weighted SwiGLU.                                     |

### Functions

| Name                                                                                                                 | Description                                                                       |
| -------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| [`geglu`](#nemo_automodel-components-moe-megatron-moe_utils-geglu)                                                   | GEGLU activation function.                                                        |
| [`geglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-geglu_back)                                         | Compute the input gradient for tanh-approximated GEGLU activation.                |
| [`permute`](#nemo_automodel-components-moe-megatron-moe_utils-permute)                                               | Permute the tokens and probs based on the mask.                                   |
| [`quick_geglu`](#nemo_automodel-components-moe-megatron-moe_utils-quick_geglu)                                       | Performs Quick-GELU-based GEGLU activation : quick\_gelu(y1) \* (y2 + offset).    |
| [`quick_geglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-quick_geglu_back)                             | Compute the input gradient for Quick-GEGLU activation.                            |
| [`quick_gelu`](#nemo_automodel-components-moe-megatron-moe_utils-quick_gelu)                                         | Sigmoid approximation of gelu                                                     |
| [`swiglu`](#nemo_automodel-components-moe-megatron-moe_utils-swiglu)                                                 | Apply SwiGLU activation to an interleaved gate/up tensor.                         |
| [`swiglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-swiglu_back)                                       | Compute the input gradient for SwiGLU activation.                                 |
| [`unpermute`](#nemo_automodel-components-moe-megatron-moe_utils-unpermute)                                           | Restore the original order of tokens after permutation. If probs are provided, it |
| [`weighted_bias_geglu_impl`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_bias_geglu_impl)             | Token-wise-weighted bias GEGLU fusion (tanh-approximated GELU gating).            |
| [`weighted_bias_quick_geglu`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_bias_quick_geglu)           | Token-wise weighted Quick-GEGLU activation with bias.                             |
| [`weighted_bias_quick_geglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_bias_quick_geglu_back) | Backward helper for weighted Quick-GEGLU with bias.                               |
| [`weighted_bias_quick_geglu_impl`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_bias_quick_geglu_impl) | Token-wise-weighted bias quick\_geglu fusion.                                     |
| [`weighted_bias_swiglu_impl`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_bias_swiglu_impl)           | Token-wise-weighted bias swiglu fusion.                                           |
| [`weighted_geglu`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_geglu)                                 | Apply GEGLU activation and token-wise routing weights.                            |
| [`weighted_geglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_geglu_back)                       | Compute input and weight gradients for weighted GEGLU.                            |
| [`weighted_quick_geglu`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_quick_geglu)                     | Token-wise-weighted Quick-GEGLU activation.                                       |
| [`weighted_quick_geglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_quick_geglu_back)           | Backward helper for weighted Quick-GEGLU.                                         |
| [`weighted_swiglu`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_swiglu)                               | Apply SwiGLU activation and token-wise routing weights.                           |
| [`weighted_swiglu_back`](#nemo_automodel-components-moe-megatron-moe_utils-weighted_swiglu_back)                     | Compute input and weight gradients for weighted SwiGLU.                           |

### API

```python
class nemo_automodel.components.moe.megatron.moe_utils.MoEAuxLossAutoScaler()
```

**Bases:** `Function`

An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.

```python
nemo_automodel.components.moe.megatron.moe_utils.MoEAuxLossAutoScaler.backward(
    ctx,
    grad_output: torch.Tensor
)
```

staticmethod

Compute and scale the gradient for auxiliary loss..

**Parameters:**

The gradient of the output.

**Returns:**

Tuple\[torch.Tensor, torch.Tensor]: The gradient of the output, scaled auxiliary loss
gradient.

```python
nemo_automodel.components.moe.megatron.moe_utils.MoEAuxLossAutoScaler.forward(
    ctx,
    output: torch.Tensor,
    aux_loss: torch.Tensor
)
```

staticmethod

Preserve the aux\_loss by storing it in the context to avoid garbage collection.

**Parameters:**

The output tensor.

The auxiliary loss tensor.

**Returns:**

torch.Tensor: The output tensor.

```python
class nemo_automodel.components.moe.megatron.moe_utils.WeightedBiasQuickGeGLUFunction()
```

**Bases:** `Function`

Autograd function for token-wise weighted Quick-GEGLU with bias support.

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedBiasQuickGeGLUFunction.backward(
    ctx,
    grad_output
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedBiasQuickGeGLUFunction.forward(
    ctx,
    input: torch.Tensor,
    bias: torch.Tensor,
    weights: torch.Tensor,
    fp8_input_store: bool,
    linear_offset: torch.Tensor
)
```

staticmethod

```python
class nemo_automodel.components.moe.megatron.moe_utils.WeightedGEGLUFunction()
```

**Bases:** `Function`

Autograd function for token-wise weighted GEGLU.

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedGEGLUFunction.backward(
    ctx,
    grad_output
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedGEGLUFunction.forward(
    ctx,
    input,
    weights,
    fp8_input_store
)
```

staticmethod

```python
class nemo_automodel.components.moe.megatron.moe_utils.WeightedQuickGeGLUFunction()
```

**Bases:** `Function`

Autograd function for token-wise weighted Quick-GEGLU (no bias).

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedQuickGeGLUFunction.backward(
    ctx,
    grad_output
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedQuickGeGLUFunction.forward(
    ctx,
    input: torch.Tensor,
    weights: torch.Tensor,
    fp8_input_store: bool,
    linear_offset: torch.Tensor
)
```

staticmethod

```python
class nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction()
```

**Bases:** `Function`

Autograd function for token-wise weighted SwiGLU.

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction.backward(
    ctx,
    grad_output
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction.forward(
    ctx,
    input,
    weights,
    fp8_input_store
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.moe_utils.geglu(
    y
)
```

GEGLU activation function.
Splits the input in half along the last dimension and applies:
GEGLU(y) = GELU\_tanh(y\_gate) \* y\_up

Used by Gemma4 MoE expert layers (hidden\_activation="gelu\_pytorch\_tanh").

```python
nemo_automodel.components.moe.megatron.moe_utils.geglu_back(
    g,
    y
)
```

Compute the input gradient for tanh-approximated GEGLU activation.

```python
nemo_automodel.components.moe.megatron.moe_utils.permute(
    tokens,
    routing_map,
    probs: typing.Optional[torch.Tensor] = None,
    num_out_tokens: typing.Optional[int] = None,
    fused: bool = False,
    drop_and_pad: bool = False
)
```

Permute the tokens and probs based on the mask.
Tokens with the same designated expert will be grouped together.
The shape of mask is \[tokens, num\_experts], it indicates which experts were selected
by each token.

When drop\_and\_pad=True, in routing\_map, the number of non-zeros in each column equals to
expert capacity. This function exploits this feature to use ops that support cuda graph.

**Parameters:**

The input token tensor, \[num\_tokens, hidden].

The sparse token to expert mapping, \[num\_tokens, num\_experts].

The probs tensor, \[num\_tokens, num\_experts].

The number of output tokens. If None, it's set to
the number of input tokens.

Whether use the fused permute function.

Whether or not the token dispatcher uses token-drop
and pads the number of tokens to the expert capacity.
If set to true, routing\_map has a fixed number of non-zeros
in each column.

**Returns:** `torch.Tensor`

The permuted token tensor.

```python
nemo_automodel.components.moe.megatron.moe_utils.quick_geglu(
    y: torch.Tensor,
    linear_offset: float = 0.0
) -> torch.Tensor
```

Performs Quick-GELU-based GEGLU activation : quick\_gelu(y1) \* (y2 + offset).

**Parameters:**

Input tensor split into two halves on the last dimension.

Optional linear offset added to the second half before gating.

**Returns:** `torch.Tensor`

Tensor after applying the GEGLU activation.

```python
nemo_automodel.components.moe.megatron.moe_utils.quick_geglu_back(
    g,
    y,
    linear_offset: float = 0.0
) -> torch.Tensor
```

Compute the input gradient for Quick-GEGLU activation.

```python
nemo_automodel.components.moe.megatron.moe_utils.quick_gelu(
    y: torch.Tensor,
    alpha: float = 1.702
) -> torch.Tensor
```

Sigmoid approximation of gelu

```python
nemo_automodel.components.moe.megatron.moe_utils.swiglu(
    y
)
```

Apply SwiGLU activation to an interleaved gate/up tensor.

```python
nemo_automodel.components.moe.megatron.moe_utils.swiglu_back(
    g,
    y
)
```

Compute the input gradient for SwiGLU activation.

```python
nemo_automodel.components.moe.megatron.moe_utils.unpermute(
    permuted_tokens: torch.Tensor,
    sorted_indices: torch.Tensor,
    restore_shape: torch.Size,
    probs: torch.Tensor = None,
    routing_map: torch.Tensor = None,
    fused: bool = False,
    drop_and_pad: bool = False
)
```

Restore the original order of tokens after permutation. If probs are provided, it
will also apply them to the tokens before restoring the order.

When drop\_and\_pad=True, the tensors will have the following properties:

* In routing\_map, the number of non-zeros in each column equals to expert capacity
* The size of sorted\_indices equals to num\_experts \* capacity, each split of `capacity`
  contains the indices of tokens routed to an expert.
  This function exploits these features to use ops that support cuda graph.

**Parameters:**

The permuted token tensor.

The indices used to sort the tokens.

The shape of the unpermuted tensor.

The unpermuted probs tensor,

Token to expert mapping, shape
\[num\_tokens, num\_experts].

Whether use the fused unpermute function.

Whether or not the token dispatcher uses token-drop
and pads the number of tokens to the expert capacity.

**Returns:**

torch.Tensor: The tokens restored to their original order.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_geglu_impl(
    input,
    weights,
    fp8_input_store = False
)
```

Token-wise-weighted bias GEGLU fusion (tanh-approximated GELU gating).

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu(
    y: torch.Tensor,
    bias: torch.Tensor,
    weights: torch.Tensor,
    linear_offset: float = 0.0
) -> torch.Tensor
```

Token-wise weighted Quick-GEGLU activation with bias.

**Parameters:**

Input tensor before bias addition.

Bias tensor broadcastable to `y`.

Weight tensor with shape `[tokens, 1]` broadcasting over feature dim.

Optional linear offset for the second half before gating.

**Returns:** `torch.Tensor`

Activated tensor with same dtype as `y`.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_back(
    g,
    y,
    bias,
    weights,
    linear_offset: float = 0.0
)
```

Backward helper for weighted Quick-GEGLU with bias.

Returns gradients w\.r.t input `y`, `bias`, and `weights`.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_impl(
    input,
    bias,
    weights,
    fp8_input_store = False,
    linear_offset = 0.0,
    clamp_value = None,
    alpha = 1.702
)
```

Token-wise-weighted bias quick\_geglu fusion.
input: \[num\_selected\_experts \* seq\_len, hidden\_size \* 2]
bias: None
weights: \[num\_selected\_experts \* seq\_len, 1]
fp8\_input\_store: bool
linear\_offset: float
output: \[num\_selected\_experts \* seq\_len, hidden\_size]

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_swiglu_impl(
    input,
    weights,
    fp8_input_store = False
)
```

Token-wise-weighted bias swiglu fusion.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_geglu(
    y,
    weights
)
```

Apply GEGLU activation and token-wise routing weights.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_geglu_back(
    g,
    y,
    weights
)
```

Compute input and weight gradients for weighted GEGLU.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu(
    y: torch.Tensor,
    weights: torch.Tensor,
    linear_offset: float = 0.0
) -> torch.Tensor
```

Token-wise-weighted Quick-GEGLU activation.

The weights tensor is expected to have the same first-dimension length as `y` and a trailing
singleton dimension so that it broadcasts over the feature dimension.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu_back(
    g,
    y,
    weights,
    linear_offset: float = 0.0
)
```

Backward helper for weighted Quick-GEGLU.
Returns gradient w\.r.t input `y` and `weights`.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu(
    y,
    weights
)
```

Apply SwiGLU activation and token-wise routing weights.

```python
nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu_back(
    g,
    y,
    weights
)
```

Compute input and weight gradients for weighted SwiGLU.