`core.transformer.moe.moe_layer`#

Module Contents#

Classes#

`RouterInterface`	Interface for the router used in an MoELayer.
`RouterBuilder`	Protocol for building a Router.
`MoESubmodules`	MoE Layer Submodule spec
`BaseMoELayer`	Base class for a mixture of experts layer.
`MoELayer`	Mixture of Experts layer.

API#

class core.transformer.moe.moe_layer.RouterInterface#

Bases: typing.Protocol

Interface for the router used in an MoELayer.

forward(input: torch.Tensor, /) → tuple[torch.Tensor, torch.Tensor]#

Forward pass of the router.

Returns:: A tuple of (probabilities, routing_map).

set_layer_number(layer_number: int) → None#

Set the layer number for the router.

Called from transformer_layer during initialization.

class core.transformer.moe.moe_layer.RouterBuilder#

Bases: typing.Protocol

Protocol for building a Router.

__call__( /, *, config: megatron.core.transformer.transformer_config.TransformerConfig, pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None, ) → core.transformer.moe.moe_layer.RouterInterface#

class core.transformer.moe.moe_layer.MoESubmodules#

MoE Layer Submodule spec

experts: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#: None

shared_experts: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#: None

router: core.transformer.moe.moe_layer.RouterBuilder#: None

class core.transformer.moe.moe_layer.BaseMoELayer( config: megatron.core.transformer.transformer_config.TransformerConfig, layer_number: Optional[int] = None, pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None, )#

Bases: megatron.core.transformer.module.MegatronModule, abc.ABC

Base class for a mixture of experts layer.

Parameters:: config (TransformerConfig) – Configuration object for the transformer model.

Initialization

abstractmethod forward(hidden_states)#: Forward method for the MoE layer.

set_layer_number(layer_number: int)#: Set the layer number for the MoE layer.

class core.transformer.moe.moe_layer.MoELayer( config: megatron.core.transformer.transformer_config.TransformerConfig, submodules: Optional[core.transformer.moe.moe_layer.MoESubmodules] = None, layer_number: Optional[int] = None, pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None, )#

Bases: core.transformer.moe.moe_layer.BaseMoELayer

Mixture of Experts layer.

This layer implements a Mixture of Experts model, where each token is routed to a subset of experts. This implementation supports different token dispatching strategies such as All-to-All and All-Gather.

Initialization

route( hidden_states: torch.Tensor, padding_mask: Optional[torch.Tensor] = None, )#

Compute token routing for preprocessing.

This method uses the router to determine which experts to send each token to, producing routing probabilities and a mapping.

preprocess( hidden_states: torch.Tensor, probs: torch.Tensor, routing_map: torch.Tensor, )#

Preprocess token routing for dispatch.

This method preprocesses the hidden states and routing probabilities for the token dispatcher.

dispatch(hidden_states: torch.Tensor, probs: torch.Tensor)#

Dispatches tokens to assigned expert ranks via communication.

This method performs the actual communication (e.g., All-to-All) to distribute tokens and their associated probabilities to the devices hosting their assigned experts.

shared_experts_compute(hidden_states: torch.Tensor)#

Computes the output of the shared experts.

If a shared expert is configured and not overlapped with communication, it is computed here.

routed_experts_compute( hidden_states: torch.Tensor, probs: torch.Tensor, )#

Computes the output of the routed experts on the dispatched tokens.

This method first post-processes the dispatched input to get permuted tokens for each expert. It then passes the tokens through the local experts. The output from the experts is preprocessed for the combine step.

combine(output: torch.Tensor)#

Combines expert outputs via communication and adds shared expert output.

This method uses the token dispatcher to combine the outputs from different experts (e.g., via an All-to-All communication).

postprocess( output: torch.Tensor, shared_expert_output: Optional[torch.Tensor], )#: Project the output back from latent dimension to hidden dimension after combine in latent dimension if needed. Combine expert output with shared_experts if needed.

router_and_preprocess(hidden_states: torch.Tensor)#: This method is a combined method of route and preprocess. Deprecated.

forward( hidden_states: torch.Tensor, intermediate_tensors=None, padding_mask: Optional[torch.Tensor] = None, )#

Forward pass for the MoE layer.

The forward pass comprises four main steps:

Routing & Preprocessing: Route tokens to the assigned experts and prepare for dispatch.
Dispatch: Tokens are sent to the expert devices using communication collectives.
Expert Computation: Experts process the dispatched tokens.
Combine: The outputs from the experts are combined and returned.

Parameters:

hidden_states (torch.Tensor) – The input tensor shape [seq_length, bsz, hidden_size].
padding_mask (torch.Tensor, optional) – Boolean mask indicating non-padding tokens. Shape [seq_length, bsz]. True for valid tokens, False for padding tokens. Defaults to None.

Returns:

A tuple containing the output tensor and the MLP bias, if any.

backward_dw(routed_experts: bool = True, shared_experts: bool = False)#: Compute weight gradients for experts and shared experts.

set_for_recompute_pre_mlp_layernorm()#: Set the MoE layer for recompute pre_mlp_layernorm. Only needed for fp8/fp4.

core.transformer.moe.moe_layer#

Module Contents#

Classes#

API#

`core.transformer.moe.moe_layer`#