nemo_automodel.components.moe.layers
nemo_automodel.components.moe.layers
Module Contents
Classes
Functions
API
Bases: Module
Load balanced gate implementation, spreads tokens uniformly across all experts. The rationale for this class is to do performance experiments to understand how the load imbalance with real data is impacting end-to-end performance.
When noise > 0, random perturbation is added to mimic realistic routing
imbalance. A noise value of 0.0 gives perfectly balanced assignment, while
1.0 gives fully random expert selection and non-uniform weights.
Forward pass for the gating mechanism.
Parameters:
Input tensor.
Boolean mask indicating valid tokens.
Device mesh for context parallel computation.
Returns: torch.Tensor
Routing weights for the selected experts.
Bases: Module
Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.
Computes the auxiliary loss for load balancing.
Warning: Assumes batch size = 1, if batch size > 1, the aux_loss will be computed across multiple sequences.
Parameters:
Original scores from the gating mechanism. Shape is [num_tokens, num_experts].
Load of each expert (number of tokens routed to each expert). Shape is [num_experts].
Boolean mask indicating valid tokens. Shape is [num_tokens].
Device mesh for context parallel computation.
Returns: torch.Tensor
torch.Tensor: Auxiliary loss for load balancing. Shape is [].
Computes the load of each expert based on the selected indices. Args: indices (torch.Tensor): Indices of the selected experts. Shape is [num_tokens, num_activated_experts]. token_mask (torch.Tensor): Boolean mask indicating valid tokens. Shape is [num_tokens].
Returns: torch.Tensor
torch.Tensor: Load of each expert (number of tokens routed to each expert). Shape is [num_local_experts].
Forward pass for the gating mechanism.
Parameters:
Input tensor.
Boolean mask indicating valid tokens.
Device mesh for context parallel computation.
Returns: torch.Tensor
Routing weights for the selected experts.
Updates the correction bias used in the gate based on the popularity of experts. This function is a NoOp if the gate is not trained.
To avoid routing collapse, and to promote better load balance of experts, DeepSeek-V3 uses a correction mechanism to adjust the scores of experts using a learned bias parameter. The bias parameter is updated based on the popularity of experts, i.e., the number of tokens routed to each expert. If an expert is more popular than the average, its bias term is decreased, and vice versa. This encourages the model to route tokens to less popular experts, promoting better load balance.
Bases: Module
Multi-Layer Perceptron (MLP) used as a feed-forward layer.
Supports both gated activations (SwiGLU) and simple activations (ReLU²).
Forward pass for the MLP layer.
Parameters:
Input tensor.
Returns: torch.Tensor
torch.Tensor: Output tensor after MLP computation.
Bases: Module
Mixture-of-Experts (MoE) module.
Forward pass for the MoE module.
Parameters:
Input tensor.
Boolean mask indicating padding positions.
Returns: torch.Tensor
torch.Tensor: Output tensor after expert routing and computation.