nemo_automodel.components.moe.megatron.moe_utils
nemo_automodel.components.moe.megatron.moe_utils
Module Contents
Classes
Functions
API
Bases: Function
An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.
Compute and scale the gradient for auxiliary loss..
Parameters:
The gradient of the output.
Returns:
Tuple[torch.Tensor, torch.Tensor]: The gradient of the output, scaled auxiliary loss gradient.
Preserve the aux_loss by storing it in the context to avoid garbage collection.
Parameters:
The output tensor.
The auxiliary loss tensor.
Returns:
torch.Tensor: The output tensor.
Bases: Function
Autograd function for token-wise weighted Quick-GEGLU with bias support.
Bases: Function
Autograd function for token-wise weighted GEGLU.
Bases: Function
Autograd function for token-wise weighted Quick-GEGLU (no bias).
Bases: Function
Autograd function for token-wise weighted SwiGLU.
GEGLU activation function. Splits the input in half along the last dimension and applies: GEGLU(y) = GELU_tanh(y_gate) * y_up
Used by Gemma4 MoE expert layers (hidden_activation=“gelu_pytorch_tanh”).
Compute the input gradient for tanh-approximated GEGLU activation.
Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.
When drop_and_pad=True, in routing_map, the number of non-zeros in each column equals to expert capacity. This function exploits this feature to use ops that support cuda graph.
Parameters:
The input token tensor, [num_tokens, hidden].
The sparse token to expert mapping, [num_tokens, num_experts].
The probs tensor, [num_tokens, num_experts].
The number of output tokens. If None, it’s set to the number of input tokens.
Whether use the fused permute function.
Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity. If set to true, routing_map has a fixed number of non-zeros in each column.
Returns: torch.Tensor
The permuted token tensor.
Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).
Parameters:
Input tensor split into two halves on the last dimension.
Optional linear offset added to the second half before gating.
Returns: torch.Tensor
Tensor after applying the GEGLU activation.
Compute the input gradient for Quick-GEGLU activation.
Sigmoid approximation of gelu
Apply SwiGLU activation to an interleaved gate/up tensor.
Compute the input gradient for SwiGLU activation.
Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.
When drop_and_pad=True, the tensors will have the following properties:
- In routing_map, the number of non-zeros in each column equals to expert capacity
- The size of sorted_indices equals to num_experts * capacity, each split of
capacitycontains the indices of tokens routed to an expert. This function exploits these features to use ops that support cuda graph.
Parameters:
The permuted token tensor.
The indices used to sort the tokens.
The shape of the unpermuted tensor.
The unpermuted probs tensor,
Token to expert mapping, shape [num_tokens, num_experts].
Whether use the fused unpermute function.
Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity.
Returns:
torch.Tensor: The tokens restored to their original order.
Token-wise-weighted bias GEGLU fusion (tanh-approximated GELU gating).
Token-wise weighted Quick-GEGLU activation with bias.
Parameters:
Input tensor before bias addition.
Bias tensor broadcastable to y.
Weight tensor with shape [tokens, 1] broadcasting over feature dim.
Optional linear offset for the second half before gating.
Returns: torch.Tensor
Activated tensor with same dtype as y.
Backward helper for weighted Quick-GEGLU with bias.
Returns gradients w.r.t input y, bias, and weights.
Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]
Token-wise-weighted bias swiglu fusion.
Apply GEGLU activation and token-wise routing weights.
Compute input and weight gradients for weighted GEGLU.
Token-wise-weighted Quick-GEGLU activation.
The weights tensor is expected to have the same first-dimension length as y and a trailing
singleton dimension so that it broadcasts over the feature dimension.
Backward helper for weighted Quick-GEGLU.
Returns gradient w.r.t input y and weights.
Apply SwiGLU activation and token-wise routing weights.
Compute input and weight gradients for weighted SwiGLU.