`nemo_automodel.components.moe.layers`#

Module Contents#

Classes#

`MoEConfig`
`MLP`	Multi-Layer Perceptron (MLP) used as a feed-forward layer.
`GroupedExperts`	Sparse MoE implementation using all-gather/reduce-scatter primitives.
`GroupedExpertsDeepEP`	Sparse MoE implementation using DeepEP.
`FakeBalancedGate`	Load balanced gate implementation, spreads tokens uniformly across all experts. The rationale for this class is to do performance experiments to understand how the load imbalance with real data is impacting end-to-end performance.
`Gate`	Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.
`MoE`	Mixture-of-Experts (MoE) module.

Functions#

`swiglu`
`quick_geglu`
`get_expert_activation`
`quick_geglu_deepep`
`get_expert_activation_for_deepep`
`_init_weights`

Data#

_shared_experts_stream

API#

nemo_automodel.components.moe.layers._shared_experts_stream: Optional[torch.cuda.Stream]#: None

class nemo_automodel.components.moe.layers.MoEConfig#

n_routed_experts: int#: None

n_shared_experts: int#: None

n_activated_experts: int#: None

n_expert_groups: int#: None

n_limited_groups: int#: None

train_gate: bool#: None

gate_bias_update_factor: float#: None

aux_loss_coeff: float#: None

score_func: str#: None

route_scale: float#: None

dim: int#: None

inter_dim: int#: None

moe_inter_dim: int#: None

norm_topk_prob: bool#: None

router_bias: bool#: False

expert_bias: bool#: False

expert_activation: Literal[swiglu, quick_geglu]#: ‘swiglu’

activation_alpha: float#: 1.702

activation_limit: float#: 7.0

softmax_before_topk: bool#: False

dtype: str | torch.dtype#: None

shared_expert_gate: bool#: False

shared_expert_inter_dim: int | None#: None

__post_init__()#

class nemo_automodel.components.moe.layers.MLP( dim: int, inter_dim: int, backend: str, dtype: torch.dtype = torch.bfloat16, )#

Bases: torch.nn.Module

Multi-Layer Perceptron (MLP) used as a feed-forward layer.

.. attribute:: gate_proj

Linear layer for input-to-hidden transformation.

Type:: nn.Module

.. attribute:: down_proj

Linear layer for hidden-to-output transformation.

Type:: nn.Module

.. attribute:: up_proj

Additional linear layer for feature transformation.

Type:: nn.Module

Initialization

Initializes the MLP layer.

Parameters:

dim (int) – Input and output dimensionality.
inter_dim (int) – Hidden layer dimensionality.

forward(x: torch.Tensor) → torch.Tensor#

Forward pass for the MLP layer.

Parameters:: x (torch.Tensor) – Input tensor.
Returns:: Output tensor after MLP computation.
Return type:: torch.Tensor

init_weights( buffer_device: torch.device, init_std: float = 0.02, ) → None#

nemo_automodel.components.moe.layers.swiglu( x, *, gate_and_up_proj, down_proj, gate_up_proj_bias=None, down_proj_bias=None, )#

nemo_automodel.components.moe.layers.quick_geglu( x, *, gate_and_up_proj, down_proj, gate_up_proj_bias=None, down_proj_bias=None, alpha: float = 1.702, limit: float | None = 7.0, )#

nemo_automodel.components.moe.layers.get_expert_activation( config: nemo_automodel.components.moe.layers.MoEConfig, )#

class nemo_automodel.components.moe.layers.GroupedExperts(config: nemo_automodel.components.moe.layers.MoEConfig)#

Bases: torch.nn.Module

Sparse MoE implementation using all-gather/reduce-scatter primitives.

Once the experts for a particular token have been identified, this module is invoked to compute and average the output of the activated experts.

.. attribute:: n_routed_experts

Total number of experts in the model.

Type:: int

.. attribute:: gate_projs

Linear layer for input-to-gate transformation.

Type:: nn.Parameter

.. attribute:: up_projs

Linear layer for input-to-hidden transformation.

Type:: nn.Parameter

.. attribute:: down_projs

Linear layer for hidden-to-output transformation.

Type:: nn.Parameter

Initialization

Initializes the GroupedExperts module.

Parameters:: args (MoEArgs) – Model arguments containing the number of routed experts, model and intermediate dimension parameters.

forward( x: torch.Tensor, token_mask: torch.Tensor, weights: torch.Tensor, indices: torch.Tensor, ) → torch.Tensor#

Forward pass for the grouped experts.

Parameters:

x (torch.Tensor) – Input tensor. Shape is [num_tokens, model_dim].
token_mask (torch.Tensor) – Boolean mask indicating valid tokens. Shape is [num_tokens].
weights (torch.Tensor) – Routing weights for the selected experts. Shape is [num_tokens, num_activated_experts].
indices (torch.Tensor) – Indices of the selected experts. Shape is [num_tokens, num_activated_experts].

Returns:

Output tensor after expert computation. Shape is [num_tokens, model_dim]

Return type:

torch.Tensor

init_weights( buffer_device: torch.device, init_std: float = 0.02, ) → None#

nemo_automodel.components.moe.layers.quick_geglu_deepep( x, permuted_probs, alpha: float = 1.702, limit: float = 7.0, linear_offset: float = 1.0, )#

nemo_automodel.components.moe.layers.get_expert_activation_for_deepep( config: nemo_automodel.components.moe.layers.MoEConfig, )#

class nemo_automodel.components.moe.layers.GroupedExpertsDeepEP( config: nemo_automodel.components.moe.layers.MoEConfig, )#

Bases: torch.nn.Module

Sparse MoE implementation using DeepEP.

Once the experts for a particular token have been identified, this module is invoked to compute and average the output of the activated experts.

.. attribute:: n_routed_experts

Total number of experts in the model.

Type:: int

.. attribute:: gate_and_up_projs part1 / gate_projs

Linear layer for input-to-gate transformation.

Type:: nn.Parameter

.. attribute:: gate_and_up_projs part2 / up_projs

Linear layer for input-to-hidden transformation.

Type:: nn.Parameter

.. attribute:: down_projs

Linear layer for hidden-to-output transformation.

Type:: nn.Parameter

Initialization

Initializes the GroupedExperts module.

Parameters:: args (MoEArgs) – Model arguments containing the number of routed experts, model and intermediate dimension parameters.

static _apply_bias(value, bias, tokens_per_expert, permuted_probs=None)#

init_token_dispatcher( ep_mesh: torch.distributed.device_mesh.DeviceMesh, )#

forward( x: torch.Tensor, token_mask: torch.Tensor, weights: torch.Tensor, indices: torch.Tensor, ) → torch.Tensor#

Forward pass for the grouped experts.

Parameters:

x (torch.Tensor) – Input tensor. Shape is [num_tokens, model_dim].
token_mask (torch.Tensor) – Boolean mask indicating valid tokens. Shape is [num_tokens].
weights (torch.Tensor) – Routing weights for the selected experts. Shape is [num_tokens, num_activated_experts].
indices (torch.Tensor) – Indices of the selected experts. Shape is [num_tokens, num_activated_experts].

Returns:

Output tensor after expert computation. Shape is [num_tokens, model_dim]

Return type:

torch.Tensor

init_weights( buffer_device: torch.device, init_std: float = 0.02, ) → None#

class nemo_automodel.components.moe.layers.FakeBalancedGate( config: nemo_automodel.components.moe.layers.MoEConfig, skip_first_n_experts: int = 0, )#

Bases: torch.nn.Module

Load balanced gate implementation, spreads tokens uniformly across all experts. The rationale for this class is to do performance experiments to understand how the load imbalance with real data is impacting end-to-end performance.

Initialization

forward( x: torch.Tensor, token_mask: torch.Tensor, cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh], ) → tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]#

Forward pass for the gating mechanism.

Parameters:

x (torch.Tensor) – Input tensor.
token_mask (torch.Tensor) – Boolean mask indicating valid tokens.
cp_mesh (Optional[DeviceMesh]) – Device mesh for context parallel computation.

Returns:

Routing weights for the selected experts. indices (torch.Tensor): Indices of the selected experts. aux_loss (Optional[torch.Tensor]): Auxiliary loss for load balancing.

Return type:

weights (torch.Tensor)

update_bias() → None#

init_weights( buffer_device: torch.device, init_std: float = 0.02, ) → None#

class nemo_automodel.components.moe.layers.Gate( config: nemo_automodel.components.moe.layers.MoEConfig, gate_precision: torch.dtype | None = None, )#

Bases: torch.nn.Module

Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.

.. attribute:: dim

Dimensionality of input features.

Type:: int

.. attribute:: topk

Number of top experts activated for each input.

Type:: int

.. attribute:: n_groups

Number of groups for routing.

Type:: int

.. attribute:: topk_groups

Number of groups to route inputs to.

Type:: int

.. attribute:: score_func

Scoring function (‘softmax’ or ‘sigmoid’).

Type:: str

.. attribute:: route_scale

Scaling factor for routing weights.

Type:: float

.. attribute:: weight

Learnable weights for the gate.

Type:: torch.nn.Parameter

.. attribute:: bias

Optional bias term for the gate.

Type:: Optional[torch.nn.Parameter]

Initialization

Initializes the Gate module.

Parameters:

config (MoEConfig) – Model configuration containing gating parameters.
gate_precision (torch.dtype | None) – Precision for gate computations (linear, softmax/sigmoid).

forward( x: torch.Tensor, token_mask: torch.Tensor, cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh], ) → tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]#

Forward pass for the gating mechanism.

Parameters:

x (torch.Tensor) – Input tensor.
token_mask (torch.Tensor) – Boolean mask indicating valid tokens.
cp_mesh (Optional[DeviceMesh]) – Device mesh for context parallel computation.

Returns:

Routing weights for the selected experts. indices (torch.Tensor): Indices of the selected experts. aux_loss (Optional[torch.Tensor]): Auxiliary loss for load balancing.

Return type:

weights (torch.Tensor)

update_bias() → None#

Updates the correction bias used in the gate based on the popularity of experts. This function is a NoOp if the gate is not trained.

To avoid routing collapse, and to promote better load balance of experts, DeepSeek-V3 uses a correction mechanism to adjust the scores of experts using a learned bias parameter. The bias parameter is updated based on the popularity of experts, i.e., the number of tokens routed to each expert. If an expert is more popular than the average, its bias term is decreased, and vice versa. This encourages the model to route tokens to less popular experts, promoting better load balance.

_compute_expert_load( indices: torch.Tensor, token_mask: torch.Tensor, ) → torch.Tensor#

Computes the load of each expert based on the selected indices.

Parameters:

indices (torch.Tensor) – Indices of the selected experts. Shape is [num_tokens, num_activated_experts].
token_mask (torch.Tensor) – Boolean mask indicating valid tokens. Shape is [num_tokens].

Returns:

Load of each expert (number of tokens routed to each expert). Shape is [num_local_experts].

Return type:

torch.Tensor

_compute_aux_loss( original_scores: torch.Tensor, expert_load: torch.Tensor, token_mask: torch.Tensor, cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh], ) → torch.Tensor#

Computes the auxiliary loss for load balancing.

Warning: Assumes batch size = 1, if batch size > 1, the aux_loss will be computed across multiple sequences.

Parameters:

original_scores (torch.Tensor) – Original scores from the gating mechanism. Shape is [num_tokens, num_experts].
expert_load (torch.Tensor) – Load of each expert (number of tokens routed to each expert). Shape is [num_experts].
token_mask (torch.Tensor) – Boolean mask indicating valid tokens. Shape is [num_tokens].
cp_mesh (Optional[DeviceMesh]) – Device mesh for context parallel computation.

Returns:

Auxiliary loss for load balancing. Shape is [].

Return type:

torch.Tensor

init_weights( buffer_device: torch.device, init_std: float = 0.02, ) → None#

class nemo_automodel.components.moe.layers.MoE( config: nemo_automodel.components.moe.layers.MoEConfig, backend: nemo_automodel.components.moe.utils.BackendConfig, )#

Bases: torch.nn.Module

Mixture-of-Experts (MoE) module.

.. attribute:: dim

Dimensionality of input features.

Type:: int

.. attribute:: n_routed_experts

Total number of experts in the model.

Type:: int

.. attribute:: n_local_experts

Number of experts handled locally in distributed systems.

Type:: int

.. attribute:: n_activated_experts

Number of experts activated for each input.

Type:: int

.. attribute:: gate

Gating mechanism to route inputs to experts.

Type:: nn.Module

.. attribute:: experts

List of expert modules.

Type:: nn.ModuleList

.. attribute:: shared_experts

Shared experts applied to all inputs.

Type:: nn.Module

Initialization

Initializes the MoE module.

Parameters:: args (MoEArgs) – Model arguments containing MoE parameters.

forward( x: torch.Tensor, padding_mask: Optional[torch.Tensor] = None, cp_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None, ) → tuple[torch.Tensor, Optional[torch.Tensor]]#

Forward pass for the MoE module.

Parameters:

x (torch.Tensor) – Input tensor.
padding_mask (Optional[torch.Tensor]) – Boolean mask indicating padding positions.

Returns:

Output tensor after expert routing and computation. Optional[torch.Tensor]: Auxiliary loss for load balancing (if applicable).

Return type:

torch.Tensor

init_weights( buffer_device: torch.device, init_std: float = 0.02, ) → None#

nemo_automodel.components.moe.layers._init_weights( module, buffer_device: torch.device, init_std: float = 0.02, )#

nemo_automodel.components.moe.layers#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_automodel.components.moe.layers`#