> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.moe.megatron.token_dispatcher

## Module Contents

### Classes

| Name                                                                                                        | Description                                                                             |
| ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| [`MoEFlexTokenDispatcher`](#nemo_automodel-components-moe-megatron-token_dispatcher-MoEFlexTokenDispatcher) | Flex token dispatcher supporting DeepEP, HybridEP, and UCCL-EP backends.                |
| [`TokenDispatcherConfig`](#nemo_automodel-components-moe-megatron-token_dispatcher-TokenDispatcherConfig)   | Configuration for MoE token dispatch and combine backends.                              |
| [`_DeepepManager`](#nemo_automodel-components-moe-megatron-token_dispatcher-_DeepepManager)                 | A manager class to handle fused all-to-all communication processes for MoE models using |
| [`_DispatchManager`](#nemo_automodel-components-moe-megatron-token_dispatcher-_DispatchManager)             | A manager class to handle dispatch and combine processes for MoE models.                |
| [`_HybridEPManager`](#nemo_automodel-components-moe-megatron-token_dispatcher-_HybridEPManager)             | A manager class to handle fused all-to-all communication processes for MoE models using |

### API

```python
class nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher(
    num_local_experts: int,
    local_expert_indices: typing.List[int],
    config: nemo_automodel.components.moe.megatron.token_dispatcher.TokenDispatcherConfig,
    ep_group: torch.distributed.ProcessGroup
)
```

Flex token dispatcher supporting DeepEP, HybridEP, and UCCL-EP backends.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher._initialize_metadata(
    num_local_tokens: int,
    probs: torch.Tensor
) -> torch.Tensor
```

Initialize the routing map and probs to a unified format covering the TPxEP group.
This design decouples the communication group from underlying model parallelism groups,
such that the communication strategy of tokens can be agnostic of TP size and EP size.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.combine_all_to_all(
    hidden_states: torch.Tensor,
    async_finish: bool = True,
    allocate_on_comm_stream: bool = True
)
```

Performs all-to-all communication to combine tokens after expert processing.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.combine_postprocess(
    hidden_states: torch.Tensor
)
```

Post-processes the combined hidden states after all-to-all communication.

This method reshapes the combined hidden states to match the original input shape.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.combine_preprocess(
    hidden_states: torch.Tensor
)
```

Pre-processes the hidden states before combining them after expert processing.

This method restores the hidden states to their original ordering before expert processing
by using the communication manager's restoration function.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.dispatch_all_to_all(
    hidden_states: torch.Tensor,
    probs: torch.Tensor = None,
    async_finish: bool = True,
    allocate_on_comm_stream: bool = True
)
```

Performs all-to-all communication to dispatch tokens across expert parallel ranks.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.dispatch_postprocess(
    hidden_states: torch.Tensor
)
```

Post-processes the dispatched hidden states after all-to-all communication.

This method retrieves the permuted hidden states by experts, calculates the number of tokens
per expert, and returns the processed data ready for expert processing.

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.dispatch_preprocess(
    hidden_states: torch.Tensor,
    num_local_tokens: int,
    probs: torch.Tensor
)
```

Preprocesses the hidden states and routing information before dispatching tokens to experts.
Args:
hidden\_states (torch.Tensor): Input hidden states to be processed
num\_local\_tokens (int): Number of tokens to be processed
probs (torch.Tensor): Routing probabilities for each token-expert pair

**Returns:**

Tuple containing:

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.dispatch_preprocess2(
    hidden_states: torch.Tensor,
    num_local_tokens: int,
    token_probs: torch.Tensor,
    token_indices: torch.Tensor
)
```

Preprocesses the hidden states and routing information before dispatching tokens to experts.

For DeepEP backend: uses token\_indices and token\_probs directly.
For HybridEP backend: converts token\_indices to routing\_map (multihot format).

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.token_permutation(
    hidden_states: torch.Tensor,
    num_local_tokens: int,
    probs: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
```

Permutes tokens according to probs and dispatches them to experts.

This method implements the token permutation process in three steps:

1. Preprocess the hidden states
2. Perform all-to-all communication to dispatch tokens
3. Post-process the dispatched tokens for expert processing

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.token_permutation2(
    hidden_states: torch.Tensor,
    num_local_tokens: int,
    token_probs: torch.Tensor,
    token_indices: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
```

Permutes tokens according to probs and dispatches them to experts.

This method implements the token permutation process in three steps:

1. Preprocess the hidden states
2. Perform all-to-all communication to dispatch tokens
3. Post-process the dispatched tokens for expert processing

```python
nemo_automodel.components.moe.megatron.token_dispatcher.MoEFlexTokenDispatcher.token_unpermutation(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

Reverses the token permutation process to restore the original token order.

This method implements the token unpermutation process in three steps:

1. Pre-process the hidden states to restore their original ordering
2. Perform all-to-all communication to combine tokens
3. Post-process the combined tokens to match the original input shape

```python
class nemo_automodel.components.moe.megatron.token_dispatcher.TokenDispatcherConfig(
    moe_enable_deepep: bool = True,
    moe_permute_fusion: bool = False,
    moe_expert_capacity_factor: typing.Optional[float] = None,
    moe_router_topk: int = 2,
    moe_router_expert_pad_multiple: typing.Optional[int] = None,
    num_moe_experts: int = 64,
    moe_router_dtype: str = 'fp32',
    moe_flex_dispatcher_backend: typing.Literal['deepep', 'hybridep', 'uccl_ep'] = 'deepep',
    moe_deepep_num_sms: int = 20,
    moe_hybridep_num_sms: int = 24,
    moe_share_token_dispatcher: bool = True,
    moe_deepep_async_dispatch: bool = False
)
```

Dataclass

Configuration for MoE token dispatch and combine backends.

Use asynchronous DeepEP/UCCL-EP dispatch and allocate dispatched tensors on the communication stream.

Number of SMs to use for DeepEP backend.

Enable DeepEP for efficient token dispatching and combine in MoE models.

moe\_expert\_capacity\_factor (float): The capacity factor for each expert, None means no token
will be dropped. The default is None.

Backend for the flex token dispatcher. Options: 'deepep', 'hybridep', or 'uccl\_ep'.

Number of SMs to use for HybridEP dispatch and combine APIs.

Fuse token rearrangement ops during token dispatching.

Data type for routing and expert output weighted averaging. Using fp32 or fp64 can
improve stability especially when the number of experts is large (e.g. finegrained-moe).
None means no changes for dtype.

Number of tokens to pad to a multiple of for each expert.

Number of experts to route to for each token.

Share one communication manager instance across MoE layers for the configured backend.

Number of experts to use for MoE layer. When set, it replaces MLP with MoE layer. Set to None
for no MoE.

```python
class nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager(
    group: torch.distributed.ProcessGroup,
    router_topk: int,
    permute_fusion: bool = False,
    capacity_factor: typing.Optional[float] = None,
    num_experts: typing.Optional[int] = None,
    num_local_experts: typing.Optional[int] = None,
    router_dtype: typing.Optional[str] = None,
    moe_router_expert_pad_multiple: typing.Optional[int] = None,
    _dispatch_fn = None,
    _combine_fn = None
)
```

**Bases:** [\_DispatchManager](#nemo_automodel-components-moe-megatron-token_dispatcher-_DispatchManager)

A manager class to handle fused all-to-all communication processes for MoE models using
DeepEP backend. See [https://github.com/deepseek-ai/deepep](https://github.com/deepseek-ai/deepep) for more details.

The workflow of the DeepEP dispatcher is:
(1) setup\_metadata(): Process routing map and probabilities to prepare dispatch metadata
(2) dispatch():

* Use fused kernel to permute tokens and perform all-to-all communication in single step
  (3) get\_permuted\_hidden\_states\_by\_instances():
* Convert routing map and probabilities to multihot format
* Permute tokens using fused kernel
  (4) get\_restored\_hidden\_states\_by\_instances():
* Reverse permutation using fused kernel
  (5) combine():
* Reverse process using fused kernel to unpermute and perform all-to-all in single step

This implementation uses fused communication kernels (fused\_dispatch/fused\_combine) that
combine permutation and communication operations for improved efficiency compared to
separate permute+alltoall steps.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager._indices_to_multihot(
    indices,
    probs
)
```

Converts a tensor of indices to a multihot vector.

**Parameters:**

\[num\_tokens, topk] token indices, where -1 means masked out.

\[num\_tokens, topk] token probabilities.

**Returns:**

Tuple\[torch.Tensor, torch.Tensor]:

* routing\_map: Multihot vector.
* probs: Multihot probabilities.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.combine(
    hidden_states: torch.Tensor,
    async_finish: bool = False,
    allocate_on_comm_stream: bool = False
) -> torch.Tensor
```

Reverse process using fused kernel to unpermute and perform all-to-all in single step

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.dispatch(
    hidden_states: torch.Tensor,
    async_finish: bool = False,
    allocate_on_comm_stream: bool = False
) -> torch.Tensor
```

Dispatch the hidden\_states

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.get_dispatched_metadata() -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.get_number_of_tokens_per_expert() -> torch.Tensor
```

Get the number of tokens per expert.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.get_permuted_hidden_states_by_experts(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

* Convert routing map and probabilities to multihot format
* Permute tokens using fused kernel

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.get_restored_hidden_states_by_experts(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

Restore the hidden states to their original ordering before expert processing

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DeepepManager.setup_metadata(
    num_local_tokens: int,
    probs: torch.Tensor
)
```

Process routing map and probabilities to prepare dispatch metadata

```python
class nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager()
```

Abstract

A manager class to handle dispatch and combine processes for MoE models.

DispatcherManager handles token dispatching according to the routing\_map of format
\[num\_local\_tokens, world\_size, num\_instances]. The routing\_map is a 3D tensor where each
element indicates whether a token should be sent to a specific rank.

num\_instances is the maximum number of tokens instances dispatched into a target rank, it
can be the number of local experts, or the size of sub\_group.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager.combine(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

abstract

Combine the hidden\_states after expert processing.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager.dispatch(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

abstract

Dispatch the hidden\_states according to the routing\_map.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager.get_dispatched_metadata() -> torch.Tensor
```

abstract

Get the metadata of the dispatched hidden\_states.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager.get_permuted_hidden_states_by_experts(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

abstract

Get the permuted hidden states by instances.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager.get_restored_hidden_states_by_experts(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

abstract

Get the restored hidden states by instances.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._DispatchManager.setup_metadata(
    routing_map: torch.Tensor,
    probs: torch.Tensor
)
```

abstract

Set up metadata of routing\_map and probs.

```python
class nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager(
    group: torch.distributed.ProcessGroup,
    num_local_experts: int,
    num_experts: int,
    router_topk: int,
    permute_fusion: bool = False,
    moe_hybridep_num_sms: int = 24
)
```

**Bases:** [\_DispatchManager](#nemo_automodel-components-moe-megatron-token_dispatcher-_DispatchManager)

A manager class to handle fused all-to-all communication processes for MoE models using
HybridEP backend. See [https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep](https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep) for more details.

The workflow of the HybridEP dispatcher is:
(1) setup\_metadata(): Process routing map and probabilities to prepare dispatch metadata
(2) dispatch():

* Permute tokens for communication, perform all-to-all communication,
  and permute tokens for experts in single step
  (3) combine():
* Unpermute tokens for communication, perform all-to-all communication,
  and unpermute tokens for attention in single step

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager._indices_to_multihot(
    indices: torch.Tensor,
    probs: torch.Tensor
)
```

Converts a tensor of indices to a multihot vector.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.combine(
    hidden_states: torch.Tensor,
    async_finish: bool = True,
    allocate_on_comm_stream: bool = True
) -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.dispatch(
    hidden_states: torch.Tensor,
    async_finish: bool = True,
    allocate_on_comm_stream: bool = True
) -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.get_dispatched_metadata() -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.get_number_of_tokens_per_expert() -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.get_permuted_hidden_states_by_experts(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.get_restored_hidden_states_by_experts(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.setup_metadata(
    routing_map: torch.Tensor,
    probs: torch.Tensor
)
```

Process routing map and probabilities to prepare dispatch metadata.

```python
nemo_automodel.components.moe.megatron.token_dispatcher._HybridEPManager.setup_metadata_from_indices(
    token_indices: torch.Tensor,
    token_probs: torch.Tensor
)
```

Convert from topk indices format to multihot routing\_map format.