nemo_automodel.components.moe.megatron.moe_utils#

Module Contents#

Classes#

WeightedSwiGLUFunction

WeightedQuickGeGLUFunction

Autograd function for token-wise weighted Quick-GEGLU (no bias).

WeightedBiasQuickGeGLUFunction

Autograd function for token-wise weighted Quick-GEGLU with bias support.

MoEAuxLossAutoScaler

An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.

Functions#

permute

Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.

unpermute

Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.

swiglu

weighted_swiglu

swiglu_back

weighted_swiglu_back

weighted_bias_swiglu_impl

Token-wise-weighted bias swiglu fusion.

quick_gelu

Sigmoid approximation of gelu

quick_geglu

Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).

weighted_quick_geglu

Token-wise-weighted Quick-GEGLU activation.

quick_geglu_back

weighted_quick_geglu_back

Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input y and weights.

weighted_bias_quick_geglu

Token-wise weighted Quick-GEGLU activation with bias.

weighted_bias_quick_geglu_back

Backward helper for weighted Quick-GEGLU with bias.

weighted_bias_quick_geglu_impl

Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]

API#

nemo_automodel.components.moe.megatron.moe_utils.permute(
tokens,
routing_map,
probs: Optional[torch.Tensor] = None,
num_out_tokens: Optional[int] = None,
fused: bool = False,
drop_and_pad: bool = False,
)#

Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.

When drop_and_pad=True, in routing_map, the number of non-zeros in each column equals to expert capacity. This function exploits this feature to use ops that support cuda graph.

Parameters:
  • tokens (torch.Tensor) – The input token tensor, [num_tokens, hidden].

  • routing_map (torch.Tensor) – The sparse token to expert mapping, [num_tokens, num_experts].

  • probs (torch.Tensor, optional) – The probs tensor, [num_tokens, num_experts].

  • num_out_tokens (int, optional) – The number of output tokens. If None, it’s set to the number of input tokens.

  • fused (bool, optional) – Whether use the fused permute function.

  • drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity. If set to true, routing_map has a fixed number of non-zeros in each column.

Returns:

The permuted token tensor. permuted_probs (torch.Tensor, optional): The permuted probs tensor. sorted_indices (torch.Tensor): The tensor of a mapping table for sorted indices used to unpermute the tokens.

Return type:

permuted_input (torch.Tensor)

nemo_automodel.components.moe.megatron.moe_utils.unpermute(
permuted_tokens: torch.Tensor,
sorted_indices: torch.Tensor,
restore_shape: torch.Size,
probs: torch.Tensor = None,
routing_map: torch.Tensor = None,
fused: bool = False,
drop_and_pad: bool = False,
)#

Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.

When drop_and_pad=True, the tensors will have the following properties:

  • In routing_map, the number of non-zeros in each column equals to expert capacity

  • The size of sorted_indices equals to num_experts * capacity, each split of capacity contains the indices of tokens routed to an expert. This function exploits these features to use ops that support cuda graph.

Parameters:
  • permuted_tokens (torch.Tensor) – The permuted token tensor.

  • sorted_indices (torch.Tensor) – The indices used to sort the tokens.

  • restore_shape (torch.Size) – The shape of the unpermuted tensor.

  • probs (torch.Tensor, optional) – The unpermuted probs tensor,

  • routing_map (torch.Tensor, optional) – Token to expert mapping, shape [num_tokens, num_experts].

  • fused (bool, optional) – Whether use the fused unpermute function.

  • drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity.

Returns:

The tokens restored to their original order.

Return type:

torch.Tensor

nemo_automodel.components.moe.megatron.moe_utils.swiglu(y)#
nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu(y, weights)#
nemo_automodel.components.moe.megatron.moe_utils.swiglu_back(g, y)#
nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu_back(g, y, weights)#
class nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction#

Bases: torch.autograd.Function

static forward(ctx, input, weights, fp8_input_store)#
static backward(ctx, grad_output)#
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_swiglu_impl(input, weights, fp8_input_store=False)#

Token-wise-weighted bias swiglu fusion.

nemo_automodel.components.moe.megatron.moe_utils.quick_gelu(y: torch.Tensor, alpha: float = 1.702) torch.Tensor#

Sigmoid approximation of gelu

nemo_automodel.components.moe.megatron.moe_utils.quick_geglu(
y: torch.Tensor,
linear_offset: float = 0.0,
) torch.Tensor#

Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).

Parameters:
  • y – Input tensor split into two halves on the last dimension.

  • linear_offset – Optional linear offset added to the second half before gating.

Returns:

Tensor after applying the GEGLU activation.

nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu(
y: torch.Tensor,
weights: torch.Tensor,
linear_offset: float = 0.0,
) torch.Tensor#

Token-wise-weighted Quick-GEGLU activation.

The weights tensor is expected to have the same first-dimension length as y and a trailing singleton dimension so that it broadcasts over the feature dimension.

nemo_automodel.components.moe.megatron.moe_utils.quick_geglu_back(g, y, linear_offset: float = 0.0) torch.Tensor#
nemo_automodel.components.moe.megatron.moe_utils.weighted_quick_geglu_back(g, y, weights, linear_offset: float = 0.0)#

Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input y and weights.

nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu(
y: torch.Tensor,
bias: torch.Tensor,
weights: torch.Tensor,
linear_offset: float = 0.0,
) torch.Tensor#

Token-wise weighted Quick-GEGLU activation with bias.

Parameters:
  • y – Input tensor before bias addition.

  • bias – Bias tensor broadcastable to y.

  • weights – Weight tensor with shape [tokens, 1] broadcasting over feature dim.

  • linear_offset – Optional linear offset for the second half before gating.

Returns:

Activated tensor with same dtype as y.

nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_back(
g,
y,
bias,
weights,
linear_offset: float = 0.0,
)#

Backward helper for weighted Quick-GEGLU with bias.

Returns gradients w.r.t input y, bias, and weights.

class nemo_automodel.components.moe.megatron.moe_utils.WeightedQuickGeGLUFunction#

Bases: torch.autograd.Function

Autograd function for token-wise weighted Quick-GEGLU (no bias).

static forward(
ctx,
input: torch.Tensor,
weights: torch.Tensor,
fp8_input_store: bool,
linear_offset: torch.Tensor,
)#
static backward(ctx, grad_output)#
class nemo_automodel.components.moe.megatron.moe_utils.WeightedBiasQuickGeGLUFunction#

Bases: torch.autograd.Function

Autograd function for token-wise weighted Quick-GEGLU with bias support.

static forward(
ctx,
input: torch.Tensor,
bias: torch.Tensor,
weights: torch.Tensor,
fp8_input_store: bool,
linear_offset: torch.Tensor,
)#
static backward(ctx, grad_output)#
nemo_automodel.components.moe.megatron.moe_utils.weighted_bias_quick_geglu_impl(
input,
bias,
weights,
fp8_input_store=False,
linear_offset=0.0,
clamp_value=None,
alpha=1.702,
)#

Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]

class nemo_automodel.components.moe.megatron.moe_utils.MoEAuxLossAutoScaler#

Bases: torch.autograd.Function

An AutoScaler that triggers the backward pass and scales the grad for auxiliary loss.

main_loss_backward_scale: torch.Tensor#

None

static forward(ctx, output: torch.Tensor, aux_loss: torch.Tensor)#

Preserve the aux_loss by storing it in the context to avoid garbage collection.

Parameters:
  • output (torch.Tensor) – The output tensor.

  • aux_loss (torch.Tensor) – The auxiliary loss tensor.

Returns:

The output tensor.

Return type:

torch.Tensor

static backward(ctx, grad_output: torch.Tensor)#

Compute and scale the gradient for auxiliary loss..

Parameters:

grad_output (torch.Tensor) – The gradient of the output.

Returns:

The gradient of the output, scaled auxiliary loss gradient.

Return type:

Tuple[torch.Tensor, torch.Tensor]