nemo_automodel.components.moe.megatron.moe_utils#

Module Contents#

Classes#

Functions#

permute

Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.

unpermute

Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.

swiglu

weighted_swiglu

swiglu_back

weighted_swiglu_back

API#

nemo_automodel.components.moe.megatron.moe_utils.permute(
tokens,
routing_map,
probs: Optional[torch.Tensor] = None,
num_out_tokens: Optional[int] = None,
fused: bool = False,
drop_and_pad: bool = False,
)#

Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.

When drop_and_pad=True, in routing_map, the number of non-zeros in each column equals to expert capacity. This function exploits this feature to use ops that support cuda graph.

Parameters:
  • tokens (torch.Tensor) – The input token tensor, [num_tokens, hidden].

  • routing_map (torch.Tensor) – The sparse token to expert mapping, [num_tokens, num_experts].

  • probs (torch.Tensor, optional) – The probs tensor, [num_tokens, num_experts].

  • num_out_tokens (int, optional) – The number of output tokens. If None, it’s set to the number of input tokens.

  • fused (bool, optional) – Whether use the fused permute function.

  • drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity. If set to true, routing_map has a fixed number of non-zeros in each column.

Returns:

The permuted token tensor. permuted_probs (torch.Tensor, optional): The permuted probs tensor. sorted_indices (torch.Tensor): The tensor of a mapping table for sorted indices used to unpermute the tokens.

Return type:

permuted_input (torch.Tensor)

nemo_automodel.components.moe.megatron.moe_utils.unpermute(
permuted_tokens: torch.Tensor,
sorted_indices: torch.Tensor,
restore_shape: torch.Size,
probs: torch.Tensor = None,
routing_map: torch.Tensor = None,
fused: bool = False,
drop_and_pad: bool = False,
)#

Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.

When drop_and_pad=True, the tensors will have the following properties:

  • In routing_map, the number of non-zeros in each column equals to expert capacity

  • The size of sorted_indices equals to num_experts * capacity, each split of capacity contains the indices of tokens routed to an expert. This function exploits these features to use ops that support cuda graph.

Parameters:
  • permuted_tokens (torch.Tensor) – The permuted token tensor.

  • sorted_indices (torch.Tensor) – The indices used to sort the tokens.

  • restore_shape (torch.Size) – The shape of the unpermuted tensor.

  • probs (torch.Tensor, optional) – The unpermuted probs tensor,

  • routing_map (torch.Tensor, optional) – Token to expert mapping, shape [num_tokens, num_experts].

  • fused (bool, optional) – Whether use the fused unpermute function.

  • drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity.

Returns:

The tokens restored to their original order.

Return type:

torch.Tensor

nemo_automodel.components.moe.megatron.moe_utils.swiglu(y)#
nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu(y, weights)#
nemo_automodel.components.moe.megatron.moe_utils.swiglu_back(g, y)#
nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu_back(g, y, weights)#
class nemo_automodel.components.moe.megatron.moe_utils.WeightedSwiGLUFunction#

Bases: torch.autograd.Function

static forward(ctx, input, weights, fp8_input_store)#
static backward(ctx, grad_output)#