`nemo_automodel.components.moe.megatron.moe_utils`#

Module Contents#

Classes#

`WeightedSwiGLUFunction`
`WeightedQuickGeGLUFunction`	Autograd function for token-wise weighted Quick-GEGLU (no bias).
`WeightedBiasQuickGeGLUFunction`	Autograd function for token-wise weighted Quick-GEGLU with bias support.
`MoEAuxLossAutoScaler`	An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.

Functions#

`permute`	Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.
`unpermute`	Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.
`swiglu`
`weighted_swiglu`
`swiglu_back`
`weighted_swiglu_back`
`weighted_bias_swiglu_impl`	Token-wise-weighted bias swiglu fusion.
`quick_gelu`	Sigmoid approximation of gelu
`quick_geglu`	Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).
`weighted_quick_geglu`	Token-wise-weighted Quick-GEGLU activation.
`quick_geglu_back`
`weighted_quick_geglu_back`	Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input `y` and `weights`.
`weighted_bias_quick_geglu`	Token-wise weighted Quick-GEGLU activation with bias.
`weighted_bias_quick_geglu_back`	Backward helper for weighted Quick-GEGLU with bias.
`weighted_bias_quick_geglu_impl`	Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]

API#

nemo_automodel.components.moe.megatron.moe_utils.permute( tokens, routing_map, probs: Optional[torch.Tensor] = None, num_out_tokens: Optional[int] = None, fused: bool = False, drop_and_pad: bool = False, )#

Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.

When drop_and_pad=True, in routing_map, the number of non-zeros in each column equals to expert capacity. This function exploits this feature to use ops that support cuda graph.

Parameters:

tokens (torch.Tensor) – The input token tensor, [num_tokens, hidden].
routing_map (torch.Tensor) – The sparse token to expert mapping, [num_tokens, num_experts].
probs (torch.Tensor, optional) – The probs tensor, [num_tokens, num_experts].
num_out_tokens (int, optional) – The number of output tokens. If None, it’s set to the number of input tokens.
fused (bool, optional) – Whether use the fused permute function.
drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity. If set to true, routing_map has a fixed number of non-zeros in each column.

Returns:

The permuted token tensor. permuted_probs (torch.Tensor, optional): The permuted probs tensor. sorted_indices (torch.Tensor): The tensor of a mapping table for sorted indices used to unpermute the tokens.

Return type:

permuted_input (torch.Tensor)

nemo_automodel.components.moe.megatron.moe_utils.unpermute( permuted_tokens: torch.Tensor, sorted_indices: torch.Tensor, restore_shape: torch.Size, probs: torch.Tensor = None, routing_map: torch.Tensor = None, fused: bool = False, drop_and_pad: bool = False, )#

Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.

When drop_and_pad=True, the tensors will have the following properties:

In routing_map, the number of non-zeros in each column equals to expert capacity
The size of sorted_indices equals to num_experts * capacity, each split of capacity contains the indices of tokens routed to an expert. This function exploits these features to use ops that support cuda graph.

Parameters:

permuted_tokens (torch.Tensor) – The permuted token tensor.
sorted_indices (torch.Tensor) – The indices used to sort the tokens.
restore_shape (torch.Size) – The shape of the unpermuted tensor.
probs (torch.Tensor, optional) – The unpermuted probs tensor,
routing_map (torch.Tensor, optional) – Token to expert mapping, shape [num_tokens, num_experts].
fused (bool, optional) – Whether use the fused unpermute function.
drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity.

Returns:

The tokens restored to their original order.

Return type:

torch.Tensor

nemo_automodel.components.moe.megatron.moe_utils.swiglu(y)#

nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu(y, weights)#

nemo_automodel.components.moe.megatron.moe_utils.swiglu_back(g, y)#

nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu_back(g, y, weights)#

class nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction#

Bases: torch.autograd.Function

static forward(ctx, input, weights, fp8_input_store)#

static backward(ctx, grad_output)#

nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_swiglu_impl(input, weights, fp8_input_store=False)#: Token-wise-weighted bias swiglu fusion.

nemo_automodel.components.moe.megatron.moe_utils.quick_gelu(y: torch.Tensor, alpha: float = 1.702) → torch.Tensor#: Sigmoid approximation of gelu

nemo_automodel.components.moe.megatron.moe_utils.quick_geglu( y: torch.Tensor, linear_offset: float = 0.0, ) → torch.Tensor#

Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).

Parameters:

y – Input tensor split into two halves on the last dimension.
linear_offset – Optional linear offset added to the second half before gating.

Returns:

Tensor after applying the GEGLU activation.

nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu( y: torch.Tensor, weights: torch.Tensor, linear_offset: float = 0.0, ) → torch.Tensor#

Token-wise-weighted Quick-GEGLU activation.

The weights tensor is expected to have the same first-dimension length as y and a trailing singleton dimension so that it broadcasts over the feature dimension.

nemo_automodel.components.moe.megatron.moe_utils.quick_geglu_back(g, y, linear_offset: float = 0.0) → torch.Tensor#

nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu_back(g, y, weights, linear_offset: float = 0.0)#: Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input y and weights.

nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu( y: torch.Tensor, bias: torch.Tensor, weights: torch.Tensor, linear_offset: float = 0.0, ) → torch.Tensor#

Token-wise weighted Quick-GEGLU activation with bias.

Parameters:

y – Input tensor before bias addition.
bias – Bias tensor broadcastable to y.
weights – Weight tensor with shape [tokens, 1] broadcasting over feature dim.
linear_offset – Optional linear offset for the second half before gating.

Returns:

Activated tensor with same dtype as y.

nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_back( g, y, bias, weights, linear_offset: float = 0.0, )#

Backward helper for weighted Quick-GEGLU with bias.

Returns gradients w.r.t input y, bias, and weights.

class nemo_automodel.components.moe.megatron.moe_utils.WeightedQuickGeGLUFunction#

Bases: torch.autograd.Function

Autograd function for token-wise weighted Quick-GEGLU (no bias).

static forward( ctx, input: torch.Tensor, weights: torch.Tensor, fp8_input_store: bool, linear_offset: torch.Tensor, )#

static backward(ctx, grad_output)#

class nemo_automodel.components.moe.megatron.moe_utils.WeightedBiasQuickGeGLUFunction#

Bases: torch.autograd.Function

Autograd function for token-wise weighted Quick-GEGLU with bias support.

static forward( ctx, input: torch.Tensor, bias: torch.Tensor, weights: torch.Tensor, fp8_input_store: bool, linear_offset: torch.Tensor, )#

static backward(ctx, grad_output)#

nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_impl( input, bias, weights, fp8_input_store=False, linear_offset=0.0, clamp_value=None, alpha=1.702, )#: Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]

class nemo_automodel.components.moe.megatron.moe_utils.MoEAuxLossAutoScaler#

Bases: torch.autograd.Function

An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.

main_loss_backward_scale: torch.Tensor#: None

static forward(ctx, output: torch.Tensor, aux_loss: torch.Tensor)#

Preserve the aux_loss by storing it in the context to avoid garbage collection.

Parameters:

output (torch.Tensor) – The output tensor.
aux_loss (torch.Tensor) – The auxiliary loss tensor.

Returns:

The output tensor.

Return type:

torch.Tensor

static backward(ctx, grad_output: torch.Tensor)#

Compute and scale the gradient for auxiliary loss..

Parameters:: grad_output (torch.Tensor) – The gradient of the output.
Returns:: The gradient of the output, scaled auxiliary loss gradient.
Return type:: Tuple[torch.Tensor, torch.Tensor]

nemo_automodel.components.moe.megatron.moe_utils#

Module Contents#

Classes#

Functions#

API#

`nemo_automodel.components.moe.megatron.moe_utils`#