nemo_automodel.components.moe.megatron.moe_utils#
Module Contents#
Classes#
Functions#
Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token. |
|
Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order. |
|
API#
- nemo_automodel.components.moe.megatron.moe_utils.permute(
- tokens,
- routing_map,
- probs: Optional[torch.Tensor] = None,
- num_out_tokens: Optional[int] = None,
- fused: bool = False,
- drop_and_pad: bool = False,
Permute the tokens and probs based on the mask. Tokens with the same designated expert will be grouped together. The shape of mask is [tokens, num_experts], it indicates which experts were selected by each token.
When drop_and_pad=True, in routing_map, the number of non-zeros in each column equals to expert capacity. This function exploits this feature to use ops that support cuda graph.
- Parameters:
tokens (torch.Tensor) – The input token tensor, [num_tokens, hidden].
routing_map (torch.Tensor) – The sparse token to expert mapping, [num_tokens, num_experts].
probs (torch.Tensor, optional) – The probs tensor, [num_tokens, num_experts].
num_out_tokens (int, optional) – The number of output tokens. If None, it’s set to the number of input tokens.
fused (bool, optional) – Whether use the fused permute function.
drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity. If set to true, routing_map has a fixed number of non-zeros in each column.
- Returns:
The permuted token tensor. permuted_probs (torch.Tensor, optional): The permuted probs tensor. sorted_indices (torch.Tensor): The tensor of a mapping table for sorted indices used to unpermute the tokens.
- Return type:
permuted_input (torch.Tensor)
- nemo_automodel.components.moe.megatron.moe_utils.unpermute(
- permuted_tokens: torch.Tensor,
- sorted_indices: torch.Tensor,
- restore_shape: torch.Size,
- probs: torch.Tensor = None,
- routing_map: torch.Tensor = None,
- fused: bool = False,
- drop_and_pad: bool = False,
Restore the original order of tokens after permutation. If probs are provided, it will also apply them to the tokens before restoring the order.
When drop_and_pad=True, the tensors will have the following properties:
In routing_map, the number of non-zeros in each column equals to expert capacity
The size of sorted_indices equals to num_experts * capacity, each split of
capacitycontains the indices of tokens routed to an expert. This function exploits these features to use ops that support cuda graph.
- Parameters:
permuted_tokens (torch.Tensor) – The permuted token tensor.
sorted_indices (torch.Tensor) – The indices used to sort the tokens.
restore_shape (torch.Size) – The shape of the unpermuted tensor.
probs (torch.Tensor, optional) – The unpermuted probs tensor,
routing_map (torch.Tensor, optional) – Token to expert mapping, shape [num_tokens, num_experts].
fused (bool, optional) – Whether use the fused unpermute function.
drop_and_pad (bool, optional) – Whether or not the token dispatcher uses token-drop and pads the number of tokens to the expert capacity.
- Returns:
The tokens restored to their original order.
- Return type:
torch.Tensor
- nemo_automodel.components.moe.megatron.moe_utils.swiglu(y)#
- nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu(y, weights)#
- nemo_automodel.components.moe.megatron.moe_utils.swiglu_back(g, y)#
- nemo_automodel.components.moe.megatron.moe_utils.weighted_swiglu_back(g, y, weights)#