`core.transformer.moe.experts`#

Module Contents#

Classes#

`GroupedMLP`	An efficient implementation of the Experts layer using GroupedGEMM.
`GroupedLinearFc1Interface`	Interface for linear_fc1 module in TEGroupedMLP.
`GroupedLinearFc1Builder`	Protocol describing how to build a linear_fc1 layer in TEGroupedMLP.
`GroupedLinearFc2Interface`	Protocol for linear_fc2 module in TEGroupedMLP.
`GroupedLinearFc2Builder`	Protocol describing how to build a linear_fc2 layer in TEGroupedMLP.
`TEGroupedMLPSubmodules`	The dataclass for ModuleSpecs of TEGroupedMLP submodules including linear fc1, activation function, linear fc2.
`TEGroupedMLP`	An efficient implementation of the Experts layer using TE’s GroupedLinear.
`SequentialMLP`	An implementation of the Experts layer using a sequence of MLP layers.

Data#

logger

API#

core.transformer.moe.experts.logger#: ‘getLogger(…)’

class core.transformer.moe.experts.GroupedMLP( num_local_experts: int, config: megatron.core.transformer.transformer_config.TransformerConfig, pg_collection: Optional[megatron.core.transformer.moe.moe_utils.ProcessGroupCollection] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

An efficient implementation of the Experts layer using GroupedGEMM.

Executes multiple experts in parallel to maximize computational efficiency.

Initialization

forward( permuted_local_hidden_states: torch.Tensor, tokens_per_expert: torch.Tensor, permuted_probs: torch.Tensor, )#: Forward step of the GroupedMLP.

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

Maps local expert to global experts. The sharded_state_dict for the weight parts are compatible with the SequentialMLP, whereas the optimizer states are not due to the limitation from weight transposing. That is, for finetuning scenario, the checkpoint is compatible with the SequentialMLP.

When singleton_local_shards metadata flag is True, experts are broken down into separate tensors and stored under separate global keys. Additionally, similarly to MLP, layers with GLU activations are broken down into separate w and v tensors.

backward_dw()#: Performs backward pass for weight gradients in Experts. Empty implementation for compatibility with SequentialMLP and TEGroupedMLP.

class core.transformer.moe.experts.GroupedLinearFc1Interface#

Bases: typing.Protocol

Interface for linear_fc1 module in TEGroupedMLP.

forward( permuted_local_hidden_states: torch.Tensor, tokens_per_expert: list[int], /, ) → tuple[torch.Tensor, torch.Tensor | None]#: Forward method for linear_fc1 module.

backward_dw() → None#: Backward method for linear_fc1 module.

class core.transformer.moe.experts.GroupedLinearFc1Builder#

Bases: typing.Protocol

Protocol describing how to build a linear_fc1 layer in TEGroupedMLP.

__call__( num_local_experts: int, input_size: int, output_size: int, /, *, config: megatron.core.transformer.transformer_config.TransformerConfig, init_method: collections.abc.Callable[[torch.Tensor], None], bias: bool, skip_bias_add: bool, is_expert: bool, tp_comm_buffer_name: str | None, pg_collection: megatron.core.transformer.moe.moe_utils.ProcessGroupCollection | None, ) → core.transformer.moe.experts.GroupedLinearFc1Interface#: Builds a linear_fc1 layer for TEGroupedMLP.

class core.transformer.moe.experts.GroupedLinearFc2Interface#

Bases: typing.Protocol

Protocol for linear_fc2 module in TEGroupedMLP.

forward( intermediate_parallel: torch.Tensor, tokens_per_expert: list[int], /, ) → tuple[torch.Tensor, torch.Tensor | None]#: Forward method for linear_fc2 module.

backward_dw() → None#: Backward method for linear_fc2 module.

class core.transformer.moe.experts.GroupedLinearFc2Builder#

Bases: typing.Protocol

Protocol describing how to build a linear_fc2 layer in TEGroupedMLP.

__call__( num_local_experts: int, input_size: int, output_size: int, /, *, config: megatron.core.transformer.transformer_config.TransformerConfig, init_method: collections.abc.Callable[[torch.Tensor], None], bias: bool, skip_bias_add: bool, is_expert: bool, tp_comm_buffer_name: str | None, pg_collection: megatron.core.transformer.moe.moe_utils.ProcessGroupCollection | None, ) → core.transformer.moe.experts.GroupedLinearFc2Interface#: Builds a linear_fc2 layer for TEGroupedMLP.

class core.transformer.moe.experts.TEGroupedMLPSubmodules#

The dataclass for ModuleSpecs of TEGroupedMLP submodules including linear fc1, activation function, linear fc2.

linear_fc1: core.transformer.moe.experts.GroupedLinearFc1Builder#: None

linear_fc2: core.transformer.moe.experts.GroupedLinearFc2Builder#: None

activation_func: megatron.core.transformer.mlp.TEActivationFunctionBuilder | None#

None

Builder for an activation function module; only used if config.use_te_activation_func is True.

class core.transformer.moe.experts.TEGroupedMLP( num_local_experts: int, config: megatron.core.transformer.transformer_config.TransformerConfig, submodules: core.transformer.moe.experts.TEGroupedMLPSubmodules, pg_collection: Optional[megatron.core.transformer.moe.moe_utils.ProcessGroupCollection] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

An efficient implementation of the Experts layer using TE’s GroupedLinear.

Executes multiple experts in parallel to maximize computational efficiency.

Initialization

static _apply_bias( intermediate_parallel, bias_parallel, tokens_per_expert, permuted_probs, )#

forward( permuted_local_hidden_states: torch.Tensor, tokens_per_expert: torch.Tensor, permuted_probs: torch.Tensor, ) → Tuple[torch.Tensor, Optional[torch.Tensor]]#

Forward of TEGroupedMLP

Parameters:

permuted_local_hidden_states (torch.Tensor) – The permuted input hidden states of the
experts. (local)
tokens_per_expert (torch.Tensor) – The number of tokens per expert.
permuted_probs (torch.Tensor) – The permuted probs of each token produced by the router.

Returns:

The output of the local experts.

Return type:

output (torch.Tensor)

sharded_state_dict( prefix: str = '', sharded_offsets: tuple = (), metadata: Optional[dict] = None, ) → megatron.core.dist_checkpointing.mapping.ShardedStateDict#: Maps local expert to global experts. The sharded state dict is interchangable with SequentialMLP’s.

backward_dw()#

Performs backward pass for weight gradients in TEGroupedMLP.

This method executes the backward pass for weight gradients by calling backward_dw() on the linear layers in reverse order (fc2 followed by fc1). If an error occurs during execution, it is caught and re-raised with a descriptive message.

class core.transformer.moe.experts.SequentialMLP( num_local_experts, config: megatron.core.transformer.transformer_config.TransformerConfig, submodules: megatron.core.transformer.mlp.MLPSubmodules, pg_collection: Optional[megatron.core.transformer.moe.moe_utils.ProcessGroupCollection] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

An implementation of the Experts layer using a sequence of MLP layers.

This class executes each expert sequentially.

Initialization

_pad_tensor_for_quantization(hidden, probs)#: Padding tensor shape to multiples of 16/32.

forward( permuted_local_hidden_states: torch.Tensor, tokens_per_expert: torch.Tensor, permuted_probs: torch.Tensor, )#: Forward step of the SequentialMLP.

backward_dw()#: Backward pass for weight gradients in SequentialMLP.

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: Maps local expert to global experts.

core.transformer.moe.experts#

Module Contents#

Classes#

Data#

API#

`core.transformer.moe.experts`#