nemo_automodel.components.moe.megatron.moe_utils
#
Module Contents#
Classes#
Autograd function for token-wise weighted Quick-GEGLU (no bias). |
|
Autograd function for token-wise weighted Quick-GEGLU with bias support. |
|
An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss. |
Functions#
Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token. |
|
Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order. |
|
Token-wise-weighted bias swiglu fusion. |
|
Sigmoid approximation of gelu |
|
Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset). |
|
Token-wise-weighted Quick-GEGLU activation. |
|
Backward helper for weighted Quick-GEGLU.
Returns gradient w.r.t input |
|
Token-wise weighted Quick-GEGLU activation with bias. |
|
Backward helper for weighted Quick-GEGLU with bias. |
|
Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size] |
API#
- nemo_automodel.components.moe.megatron.moe_utils.permute(
- tokens,
- routing_map,
- probs: Optional[torch.Tensor] = None,
- num_out_tokens: Optional[int] = None,
- fused: bool = False,
- drop_and_pad: bool = False,
Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.
When drop_and_pad=True, in routing_map, the number of non-zeros in each column equals to expert capacity. This function exploits this feature to use ops that support cuda graph.
- Parameters:
tokens (torch.Tensor) β The input token tensor, [num_tokens, hidden].
routing_map (torch.Tensor) β The sparse token to expert mapping, [num_tokens, num_experts].
probs (torch.Tensor, optional) β The probs tensor, [num_tokens, num_experts].
num_out_tokens (int, optional) β The number of output tokens. If None, itβs set to the number of input tokens.
fused (bool, optional) β Whether use the fused permute function.
drop_and_pad (bool, optional) β Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity. If set to true, routing_map has a fixed number of non-zeros in each column.
- Returns:
The permuted token tensor. permuted_probs (torch.Tensor, optional): The permuted probs tensor. sorted_indices (torch.Tensor): The tensor of a mapping table for sorted indices used to unpermute the tokens.
- Return type:
permuted_input (torch.Tensor)
- nemo_automodel.components.moe.megatron.moe_utils.unpermute(
- permuted_tokens: torch.Tensor,
- sorted_indices: torch.Tensor,
- restore_shape: torch.Size,
- probs: torch.Tensor = None,
- routing_map: torch.Tensor = None,
- fused: bool = False,
- drop_and_pad: bool = False,
Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.
When drop_and_pad=True, the tensors will have the following properties:
In routing_map, the number of non-zeros in each column equals to expert capacity
The size of sorted_indices equals to num_experts * capacity, each split of
capacity
contains the indices of tokens routed to an expert. This function exploits these features to use ops that support cuda graph.
- Parameters:
permuted_tokens (torch.Tensor) β The permuted token tensor.
sorted_indices (torch.Tensor) β The indices used to sort the tokens.
restore_shape (torch.Size) β The shape of the unpermuted tensor.
probs (torch.Tensor, optional) β The unpermuted probs tensor,
routing_map (torch.Tensor, optional) β Token to expert mapping, shape [num_tokens, num_experts].
fused (bool, optional) β Whether use the fused unpermute function.
drop_and_pad (bool, optional) β Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity.
- Returns:
The tokens restored to their original order.
- Return type:
torch.Tensor
- nemo_automodel.components.moe.megatron.moe_utils.swiglu(y)#
- nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu(y, weights)#
- nemo_automodel.components.moe.megatron.moe_utils.swiglu_back(g, y)#
- nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu_back(g, y, weights)#
- class nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction#
Bases:
torch.autograd.Function
- static forward(ctx, input, weights, fp8_input_store)#
- static backward(ctx, grad_output)#
- nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_swiglu_impl(input, weights, fp8_input_store=False)#
Token-wise-weighted bias swiglu fusion.
- nemo_automodel.components.moe.megatron.moe_utils.quick_gelu(y: torch.Tensor, alpha: float = 1.702) torch.Tensor #
Sigmoid approximation of gelu
- nemo_automodel.components.moe.megatron.moe_utils.quick_geglu(
- y: torch.Tensor,
- linear_offset: float = 0.0,
Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).
- Parameters:
y β Input tensor split into two halves on the last dimension.
linear_offset β Optional linear offset added to the second half before gating.
- Returns:
Tensor after applying the GEGLU activation.
- nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu(
- y: torch.Tensor,
- weights: torch.Tensor,
- linear_offset: float = 0.0,
Token-wise-weighted Quick-GEGLU activation.
The weights tensor is expected to have the same first-dimension length as
y
and a trailing singleton dimension so that it broadcasts over the feature dimension.
- nemo_automodel.components.moe.megatron.moe_utils.quick_geglu_back(g, y, linear_offset: float = 0.0) torch.Tensor #
- nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu_back(g, y, weights, linear_offset: float = 0.0)#
Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input
y
andweights
.
- nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu(
- y: torch.Tensor,
- bias: torch.Tensor,
- weights: torch.Tensor,
- linear_offset: float = 0.0,
Token-wise weighted Quick-GEGLU activation with bias.
- Parameters:
y β Input tensor before bias addition.
bias β Bias tensor broadcastable to
y
.weights β Weight tensor with shape
[tokens, 1]
broadcasting over feature dim.linear_offset β Optional linear offset for the second half before gating.
- Returns:
Activated tensor with same dtype as
y
.
- nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_back(
- g,
- y,
- bias,
- weights,
- linear_offset: float = 0.0,
Backward helper for weighted Quick-GEGLU with bias.
Returns gradients w.r.t input
y
,bias
, andweights
.
- class nemo_automodel.components.moe.megatron.moe_utils.WeightedQuickGeGLUFunction#
Bases:
torch.autograd.Function
Autograd function for token-wise weighted Quick-GEGLU (no bias).
- static forward(
- ctx,
- input: torch.Tensor,
- weights: torch.Tensor,
- fp8_input_store: bool,
- linear_offset: torch.Tensor,
- static backward(ctx, grad_output)#
- class nemo_automodel.components.moe.megatron.moe_utils.WeightedBiasQuickGeGLUFunction#
Bases:
torch.autograd.Function
Autograd function for token-wise weighted Quick-GEGLU with bias support.
- static forward(
- ctx,
- input: torch.Tensor,
- bias: torch.Tensor,
- weights: torch.Tensor,
- fp8_input_store: bool,
- linear_offset: torch.Tensor,
- static backward(ctx, grad_output)#
- nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_impl(
- input,
- bias,
- weights,
- fp8_input_store=False,
- linear_offset=0.0,
- clamp_value=None,
- alpha=1.702,
Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]
- class nemo_automodel.components.moe.megatron.moe_utils.MoEAuxLossAutoScaler#
Bases:
torch.autograd.Function
An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.
- main_loss_backward_scale: torch.Tensor#
None
- static forward(ctx, output: torch.Tensor, aux_loss: torch.Tensor)#
Preserve the aux_loss by storing it in the context to avoid garbage collection.
- Parameters:
output (torch.Tensor) β The output tensor.
aux_loss (torch.Tensor) β The auxiliary loss tensor.
- Returns:
The output tensor.
- Return type:
torch.Tensor
- static backward(ctx, grad_output: torch.Tensor)#
Compute and scale the gradient for auxiliary loss..
- Parameters:
grad_output (torch.Tensor) β The gradient of the output.
- Returns:
The gradient of the output, scaled auxiliary loss gradient.
- Return type:
Tuple[torch.Tensor, torch.Tensor]