nemo_automodel.components.moe.layers
#
Module Contents#
Classes#
Multi-Layer Perceptron (MLP) used as a feed-forward layer. |
|
Sparse MoE implementation using all-gather/reduce-scatter primitives. |
|
Sparse MoE implementation using DeepEP. |
|
Load balanced gate implementation, spreads tokens uniformly across all experts. The rationale for this class is to do performance experiments to understand how the load imbalance with real data is impacting end-to-end performance. |
|
Gating mechanism for routing inputs in a mixture-of-experts (MoE) model. |
|
Mixture-of-Experts (MoE) module. |
Functions#
Data#
API#
None
- class nemo_automodel.components.moe.layers.MoEConfig#
- n_routed_experts: int#
None
None
- n_activated_experts: int#
None
- n_expert_groups: int#
None
- n_limited_groups: int#
None
- train_gate: bool#
None
- gate_bias_update_factor: float#
None
- aux_loss_coeff: float#
None
- score_func: str#
None
- route_scale: float#
None
- dim: int#
None
- inter_dim: int#
None
- moe_inter_dim: int#
None
- norm_topk_prob: bool#
None
- router_bias: bool#
False
- expert_bias: bool#
False
- expert_activation: Literal[swiglu, quick_geglu]#
โswigluโ
- activation_alpha: float#
1.702
- activation_limit: float#
7.0
- class nemo_automodel.components.moe.layers.MLP(dim: int, inter_dim: int, backend: str)#
Bases:
torch.nn.Module
Multi-Layer Perceptron (MLP) used as a feed-forward layer.
.. attribute:: gate_proj
Linear layer for input-to-hidden transformation.
- Type:
nn.Module
.. attribute:: down_proj
Linear layer for hidden-to-output transformation.
- Type:
nn.Module
.. attribute:: up_proj
Additional linear layer for feature transformation.
- Type:
nn.Module
Initialization
Initializes the MLP layer.
- Parameters:
dim (int) โ Input and output dimensionality.
inter_dim (int) โ Hidden layer dimensionality.
- forward(x: torch.Tensor) torch.Tensor #
Forward pass for the MLP layer.
- Parameters:
x (torch.Tensor) โ Input tensor.
- Returns:
Output tensor after MLP computation.
- Return type:
torch.Tensor
- init_weights(
- buffer_device: torch.device,
- init_std: float = 0.02,
- nemo_automodel.components.moe.layers.swiglu(
- x,
- *,
- gate_and_up_proj,
- down_proj,
- gate_up_proj_bias=None,
- down_proj_bias=None,
- nemo_automodel.components.moe.layers.quick_geglu(
- x,
- *,
- gate_and_up_proj,
- down_proj,
- gate_up_proj_bias=None,
- down_proj_bias=None,
- alpha: float = 1.702,
- limit: float | None = 7.0,
- nemo_automodel.components.moe.layers.get_expert_activation( )#
- class nemo_automodel.components.moe.layers.GroupedExperts(config: nemo_automodel.components.moe.layers.MoEConfig)#
Bases:
torch.nn.Module
Sparse MoE implementation using all-gather/reduce-scatter primitives.
Once the experts for a particular token have been identified, this module is invoked to compute and average the output of the activated experts.
.. attribute:: n_routed_experts
Total number of experts in the model.
- Type:
int
.. attribute:: gate_projs
Linear layer for input-to-gate transformation.
- Type:
nn.Parameter
.. attribute:: up_projs
Linear layer for input-to-hidden transformation.
- Type:
nn.Parameter
.. attribute:: down_projs
Linear layer for hidden-to-output transformation.
- Type:
nn.Parameter
Initialization
Initializes the GroupedExperts module.
- Parameters:
args (MoEArgs) โ Model arguments containing the number of routed experts, model and intermediate dimension parameters.
- forward(
- x: torch.Tensor,
- token_mask: torch.Tensor,
- weights: torch.Tensor,
- indices: torch.Tensor,
Forward pass for the grouped experts.
- Parameters:
x (torch.Tensor) โ Input tensor. Shape is [num_tokens, model_dim].
token_mask (torch.Tensor) โ Boolean mask indicating valid tokens. Shape is [num_tokens].
weights (torch.Tensor) โ Routing weights for the selected experts. Shape is [num_tokens, num_activated_experts].
indices (torch.Tensor) โ Indices of the selected experts. Shape is [num_tokens, num_activated_experts].
- Returns:
Output tensor after expert computation. Shape is [num_tokens, model_dim]
- Return type:
torch.Tensor
- init_weights(
- buffer_device: torch.device,
- init_std: float = 0.02,
- nemo_automodel.components.moe.layers.quick_geglu_deepep(
- x,
- permuted_probs,
- alpha: float = 1.702,
- limit: float = 7.0,
- linear_offset: float = 1.0,
- nemo_automodel.components.moe.layers.get_expert_activation_for_deepep( )#
- class nemo_automodel.components.moe.layers.GroupedExpertsDeepEP( )#
Bases:
torch.nn.Module
Sparse MoE implementation using DeepEP.
Once the experts for a particular token have been identified, this module is invoked to compute and average the output of the activated experts.
.. attribute:: n_routed_experts
Total number of experts in the model.
- Type:
int
.. attribute:: gate_and_up_projs part1 / gate_projs
Linear layer for input-to-gate transformation.
- Type:
nn.Parameter
.. attribute:: gate_and_up_projs part2 / up_projs
Linear layer for input-to-hidden transformation.
- Type:
nn.Parameter
.. attribute:: down_projs
Linear layer for hidden-to-output transformation.
- Type:
nn.Parameter
Initialization
Initializes the GroupedExperts module.
- Parameters:
args (MoEArgs) โ Model arguments containing the number of routed experts, model and intermediate dimension parameters.
- static _apply_bias(value, bias, tokens_per_expert, permuted_probs=None)#
- init_token_dispatcher(
- ep_mesh: torch.distributed.device_mesh.DeviceMesh,
- forward(
- x: torch.Tensor,
- token_mask: torch.Tensor,
- weights: torch.Tensor,
- indices: torch.Tensor,
Forward pass for the grouped experts.
- Parameters:
x (torch.Tensor) โ Input tensor. Shape is [num_tokens, model_dim].
token_mask (torch.Tensor) โ Boolean mask indicating valid tokens. Shape is [num_tokens].
weights (torch.Tensor) โ Routing weights for the selected experts. Shape is [num_tokens, num_activated_experts].
indices (torch.Tensor) โ Indices of the selected experts. Shape is [num_tokens, num_activated_experts].
- Returns:
Output tensor after expert computation. Shape is [num_tokens, model_dim]
- Return type:
torch.Tensor
- init_weights(
- buffer_device: torch.device,
- init_std: float = 0.02,
- class nemo_automodel.components.moe.layers.FakeBalancedGate( )#
Bases:
torch.nn.Module
Load balanced gate implementation, spreads tokens uniformly across all experts. The rationale for this class is to do performance experiments to understand how the load imbalance with real data is impacting end-to-end performance.
Initialization
- forward(
- x: torch.Tensor,
- token_mask: torch.Tensor,
- cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh],
Forward pass for the gating mechanism.
- Parameters:
x (torch.Tensor) โ Input tensor.
token_mask (torch.Tensor) โ Boolean mask indicating valid tokens.
cp_mesh (Optional[DeviceMesh]) โ Device mesh for context parallel computation.
- Returns:
Routing weights for the selected experts. indices (torch.Tensor): Indices of the selected experts. aux_loss (Optional[torch.Tensor]): Auxiliary loss for load balancing.
- Return type:
weights (torch.Tensor)
- update_bias() None #
- init_weights(
- buffer_device: torch.device,
- init_std: float = 0.02,
- class nemo_automodel.components.moe.layers.Gate(config: nemo_automodel.components.moe.layers.MoEConfig)#
Bases:
torch.nn.Module
Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.
.. attribute:: dim
Dimensionality of input features.
- Type:
int
.. attribute:: topk
Number of top experts activated for each input.
- Type:
int
.. attribute:: n_groups
Number of groups for routing.
- Type:
int
.. attribute:: topk_groups
Number of groups to route inputs to.
- Type:
int
.. attribute:: score_func
Scoring function (โsoftmaxโ or โsigmoidโ).
- Type:
str
.. attribute:: route_scale
Scaling factor for routing weights.
- Type:
float
.. attribute:: weight
Learnable weights for the gate.
- Type:
torch.nn.Parameter
.. attribute:: bias
Optional bias term for the gate.
- Type:
Optional[torch.nn.Parameter]
Initialization
Initializes the Gate module.
- Parameters:
args (MoEArgs) โ Model arguments containing gating parameters.
- forward(
- x: torch.Tensor,
- token_mask: torch.Tensor,
- cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh],
Forward pass for the gating mechanism.
- Parameters:
x (torch.Tensor) โ Input tensor.
token_mask (torch.Tensor) โ Boolean mask indicating valid tokens.
cp_mesh (Optional[DeviceMesh]) โ Device mesh for context parallel computation.
- Returns:
Routing weights for the selected experts. indices (torch.Tensor): Indices of the selected experts. aux_loss (Optional[torch.Tensor]): Auxiliary loss for load balancing.
- Return type:
weights (torch.Tensor)
- update_bias() None #
Updates the correction bias used in the gate based on the popularity of experts. This function is a NoOp if the gate is not trained.
To avoid routing collapse, and to promote better load balance of experts, DeepSeek-V3 uses a correction mechanism to adjust the scores of experts using a learned bias parameter. The bias parameter is updated based on the popularity of experts, i.e., the number of tokens routed to each expert. If an expert is more popular than the average, its bias term is decreased, and vice versa. This encourages the model to route tokens to less popular experts, promoting better load balance.
- _compute_expert_load(
- indices: torch.Tensor,
- token_mask: torch.Tensor,
Computes the load of each expert based on the selected indices.
- Parameters:
indices (torch.Tensor) โ Indices of the selected experts. Shape is [num_tokens, num_activated_experts].
token_mask (torch.Tensor) โ Boolean mask indicating valid tokens. Shape is [num_tokens].
- Returns:
Load of each expert (number of tokens routed to each expert). Shape is [num_local_experts].
- Return type:
torch.Tensor
- _compute_aux_loss(
- original_scores: torch.Tensor,
- expert_load: torch.Tensor,
- token_mask: torch.Tensor,
- cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh],
Computes the auxiliary loss for load balancing.
Warning: Assumes batch size = 1, if batch size > 1, the aux_loss will be computed across multiple sequences.
- Parameters:
original_scores (torch.Tensor) โ Original scores from the gating mechanism. Shape is [num_tokens, num_experts].
expert_load (torch.Tensor) โ Load of each expert (number of tokens routed to each expert). Shape is [num_experts].
token_mask (torch.Tensor) โ Boolean mask indicating valid tokens. Shape is [num_tokens].
cp_mesh (Optional[DeviceMesh]) โ Device mesh for context parallel computation.
- Returns:
Auxiliary loss for load balancing. Shape is [].
- Return type:
torch.Tensor
- init_weights(
- buffer_device: torch.device,
- init_std: float = 0.02,
- class nemo_automodel.components.moe.layers.MoE(
- config: nemo_automodel.components.moe.layers.MoEConfig,
- backend: nemo_automodel.components.moe.utils.BackendConfig,
Bases:
torch.nn.Module
Mixture-of-Experts (MoE) module.
.. attribute:: dim
Dimensionality of input features.
- Type:
int
.. attribute:: n_routed_experts
Total number of experts in the model.
- Type:
int
.. attribute:: n_local_experts
Number of experts handled locally in distributed systems.
- Type:
int
.. attribute:: n_activated_experts
Number of experts activated for each input.
- Type:
int
.. attribute:: gate
Gating mechanism to route inputs to experts.
- Type:
nn.Module
.. attribute:: experts
List of expert modules.
- Type:
nn.ModuleList
.. attribute:: shared_experts
Shared experts applied to all inputs.
- Type:
nn.Module
Initialization
Initializes the MoE module.
- Parameters:
args (MoEArgs) โ Model arguments containing MoE parameters.
- forward(
- x: torch.Tensor,
- padding_mask: Optional[torch.Tensor] = None,
- cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
Forward pass for the MoE module.
- Parameters:
x (torch.Tensor) โ Input tensor.
padding_mask (Optional[torch.Tensor]) โ Boolean mask indicating padding positions.
- Returns:
Output tensor after expert routing and computation. Optional[torch.Tensor]: Auxiliary loss for load balancing (if applicable).
- Return type:
torch.Tensor
- init_weights(
- buffer_device: torch.device,
- init_std: float = 0.02,
- nemo_automodel.components.moe.layers._init_weights(
- module,
- buffer_device: torch.device,
- init_std: float = 0.02,