> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.moe.megatron.fused_a2a

## Module Contents

### Classes

| Name                                                                                       | Description                                                                               |
| ------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| [`FusedCombine`](#nemo_automodel-components-moe-megatron-fused_a2a-FusedCombine)           | Fused combine operation for MoE output combining computation and communication.           |
| [`FusedDispatch`](#nemo_automodel-components-moe-megatron-fused_a2a-FusedDispatch)         | Fused dispatch operation for MoE routing combining computation and communication.         |
| [`HybridEPCombine`](#nemo_automodel-components-moe-megatron-fused_a2a-HybridEPCombine)     | Fused combine operation for permute + combine a2a + permute using the HybridEP backend.   |
| [`HybridEPDispatch`](#nemo_automodel-components-moe-megatron-fused_a2a-HybridEPDispatch)   | Fused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend. |
| [`UCCLFusedCombine`](#nemo_automodel-components-moe-megatron-fused_a2a-UCCLFusedCombine)   | Fused combine using UCCL-EP instead of DeepEP.                                            |
| [`UCCLFusedDispatch`](#nemo_automodel-components-moe-megatron-fused_a2a-UCCLFusedDispatch) | Fused dispatch using UCCL-EP instead of DeepEP.                                           |

### Functions

| Name                                                                                                 | Description                                                                               |
| ---------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| [`_is_nvshmem_available`](#nemo_automodel-components-moe-megatron-fused_a2a-_is_nvshmem_available)   | Check if DeepEP was compiled with NVSHMEM support.                                        |
| [`free_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-free_buffer)                       | Destroy the global DeepEP `Buffer` and release its NVSHMEM/cpp runtime.                   |
| [`fused_combine`](#nemo_automodel-components-moe-megatron-fused_a2a-fused_combine)                   | Perform fused combine operation if deep\_ep is available.                                 |
| [`fused_dispatch`](#nemo_automodel-components-moe-megatron-fused_a2a-fused_dispatch)                 | Perform fused dispatch operation if deep\_ep is available.                                |
| [`get_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-get_buffer)                         | Get or create a buffer for all-to-all communication.                                      |
| [`get_hidden_bytes`](#nemo_automodel-components-moe-megatron-fused_a2a-get_hidden_bytes)             | Calculate the number of hidden bytes for a tensor.                                        |
| [`get_uccl_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-get_uccl_buffer)               | Get or create a UCCL-EP buffer for all-to-all communication.                              |
| [`hybrid_ep_combine`](#nemo_automodel-components-moe-megatron-fused_a2a-hybrid_ep_combine)           | Perform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend. |
| [`hybrid_ep_dispatch`](#nemo_automodel-components-moe-megatron-fused_a2a-hybrid_ep_dispatch)         | Perform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.   |
| [`init_hybrid_ep_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-init_hybrid_ep_buffer)   | Initialize the HybridEP buffer, including buffer allocation and metadata initialization.  |
| [`reset_hybrid_ep_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-reset_hybrid_ep_buffer) | Reset the HybridEP buffer.                                                                |
| [`set_deepep_num_sms`](#nemo_automodel-components-moe-megatron-fused_a2a-set_deepep_num_sms)         | Sets the number of SMs to use for DeepEP.                                                 |
| [`set_uccl_num_sms`](#nemo_automodel-components-moe-megatron-fused_a2a-set_uccl_num_sms)             | Sets the number of SMs to use for UCCL-EP.                                                |
| [`uccl_fused_combine`](#nemo_automodel-components-moe-megatron-fused_a2a-uccl_fused_combine)         | Perform fused combine using UCCL-EP.                                                      |
| [`uccl_fused_dispatch`](#nemo_automodel-components-moe-megatron-fused_a2a-uccl_fused_dispatch)       | Perform fused dispatch using UCCL-EP.                                                     |

### Data

[`HAVE_DEEP_EP`](#nemo_automodel-components-moe-megatron-fused_a2a-HAVE_DEEP_EP)

[`HAVE_HYBRIDEP`](#nemo_automodel-components-moe-megatron-fused_a2a-HAVE_HYBRIDEP)

[`HAVE_UCCL_EP`](#nemo_automodel-components-moe-megatron-fused_a2a-HAVE_UCCL_EP)

[`_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-_buffer)

[`_hybrid_ep_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-_hybrid_ep_buffer)

[`_nvshmem_available`](#nemo_automodel-components-moe-megatron-fused_a2a-_nvshmem_available)

[`_uccl_buffer`](#nemo_automodel-components-moe-megatron-fused_a2a-_uccl_buffer)

### API

```python
class nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine()
```

**Bases:** `Function`

Fused combine operation for MoE output combining computation and communication.

```python
nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine.backward(
    ctx,
    grad_output,
    previous_event = None
)
```

staticmethod

Backward pass of fused combine.

```python
nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine.forward(
    ctx,
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

staticmethod

Forward pass of fused combine.

```python
class nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch()
```

**Bases:** `Function`

Fused dispatch operation for MoE routing combining computation and communication.

```python
nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch.backward(
    ctx,
    grad_output,
    grad_token_indices,
    grad_token_probs,
    grad_tokens_per_expert,
    grad_handle
)
```

staticmethod

Backward pass of fused dispatch.

```python
nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch.forward(
    ctx,
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

staticmethod

Forward pass of fused dispatch.

```python
class nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine()
```

**Bases:** `Function`

Fused combine operation for permute + combine a2a + permute using the HybridEP backend.

```python
nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine.backward(
    ctx,
    grad_x
)
```

staticmethod

Backward pass of fused combine of the HybridEP backend.

```python
nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine.forward(
    ctx,
    x,
    handle,
    num_permuted_tokens = None,
    pad_multiple = None
)
```

staticmethod

Forward pass of fused combine of the HybridEP backend.

```python
class nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch()
```

**Bases:** `Function`

Fused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend.

```python
nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch.backward(
    ctx,
    grad_x,
    grad_probs,
    grad_scaling_factor,
    grad_tokens_per_expert,
    grad_handle
)
```

staticmethod

Backward pass of fused dispatch of the HybridEP backend.

```python
nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch.forward(
    ctx,
    x,
    routing_map,
    probs,
    group,
    num_local_experts,
    num_sms_dispatch_api = 24,
    num_sms_combine_api = 24,
    num_permuted_tokens = None,
    pad_multiple = None
)
```

staticmethod

Forward pass of fused dispatch of the HybridEP backend.

```python
class nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine()
```

**Bases:** `Function`

Fused combine using UCCL-EP instead of DeepEP.

```python
nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine.backward(
    ctx,
    grad_output,
    _grad_event = None
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine.forward(
    ctx,
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

staticmethod

```python
class nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch()
```

**Bases:** `Function`

Fused dispatch using UCCL-EP instead of DeepEP.

```python
nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch.backward(
    ctx,
    grad_output,
    grad_token_indices,
    grad_token_probs,
    grad_tokens_per_expert,
    grad_handle
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch.forward(
    ctx,
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

staticmethod

```python
nemo_automodel.components.moe.megatron.fused_a2a._is_nvshmem_available() -> bool
```

Check if DeepEP was compiled with NVSHMEM support.

Uses is\_sm90\_compiled() as proxy — DeepEP's build enforces that
NVSHMEM is disabled when SM90 features are disabled.

```python
nemo_automodel.components.moe.megatron.fused_a2a.free_buffer() -> None
```

Destroy the global DeepEP `Buffer` and release its NVSHMEM/cpp runtime.

DeepEP keeps a process-global communication buffer backed by NVSHMEM symmetric memory.
It is normally never torn down (`destroy_process_group` hangs on DeepEP's NCCL
sub-groups, so cleanup is skipped), but that leftover GPU state survives process exit for
the whole Slurm allocation and corrupts subsequent forwards. Destroying the buffer first
frees the runtime and lets a clean `destroy_process_group` follow without hanging.

```python
nemo_automodel.components.moe.megatron.fused_a2a.fused_combine(
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

Perform fused combine operation if deep\_ep is available.

**Parameters:**

Input tensor

Process group

Communication handle

Previous CUDA event

**Returns:**

Result of FusedCombine

```python
nemo_automodel.components.moe.megatron.fused_a2a.fused_dispatch(
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

Perform fused dispatch operation if deep\_ep is available.

**Parameters:**

Input tensor \[num\_tokens, hidden\_size]

Token routing indices \[num\_tokens, topk]

Token routing probabilities \[num\_tokens, topk]

Number of experts

Process group

Previous CUDA event

**Returns:**

Result of FusedDispatch

```python
nemo_automodel.components.moe.megatron.fused_a2a.get_buffer(
    group: torch.distributed.ProcessGroup,
    hidden_bytes: int
)
```

Get or create a buffer for all-to-all communication.

**Parameters:**

Process group for communication

Number of hidden bytes needed

**Returns:**

Communication buffer

```python
nemo_automodel.components.moe.megatron.fused_a2a.get_hidden_bytes(
    x: torch.Tensor
) -> int
```

Calculate the number of hidden bytes for a tensor.

**Parameters:**

Input tensor

**Returns:** `int`

Number of hidden bytes

```python
nemo_automodel.components.moe.megatron.fused_a2a.get_uccl_buffer(
    group: torch.distributed.ProcessGroup,
    hidden_bytes: int
)
```

Get or create a UCCL-EP buffer for all-to-all communication.

```python
nemo_automodel.components.moe.megatron.fused_a2a.hybrid_ep_combine(
    x,
    handle,
    num_permuted_tokens = None,
    pad_multiple = None
)
```

Perform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend.

```python
nemo_automodel.components.moe.megatron.fused_a2a.hybrid_ep_dispatch(
    x,
    routing_map,
    probs,
    group,
    num_local_experts,
    num_sms_dispatch_api = 24,
    num_sms_combine_api = 24,
    num_permuted_tokens = None,
    pad_multiple = None
)
```

Perform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.

```python
nemo_automodel.components.moe.megatron.fused_a2a.init_hybrid_ep_buffer(
    group: torch.distributed.ProcessGroup,
    hidden_dim: int,
    seq_len: int,
    num_local_experts: int,
    num_sms_dispatch_api: int,
    num_sms_combine_api: int,
    fp8_dispatch: bool
) -> None
```

Initialize the HybridEP buffer, including buffer allocation and metadata initialization.

If a runtime dispatch/combine requires a larger buffer than the one
initialized, the buffer will be reallocated at runtime,
incuring extra run-time overhead.

**Parameters:**

Process group for HybridEP all-to-all communication.

Hidden dimension of the input tensor.

Maximum sequence length of the input tensor.

Number of local experts.

Number of SMs used by the dispatch API.

Number of SMs used by the combine API.

Whether to use FP8 communication during the dispatch phase.

```python
nemo_automodel.components.moe.megatron.fused_a2a.reset_hybrid_ep_buffer()
```

Reset the HybridEP buffer.

```python
nemo_automodel.components.moe.megatron.fused_a2a.set_deepep_num_sms(
    num_sms
)
```

Sets the number of SMs to use for DeepEP.

```python
nemo_automodel.components.moe.megatron.fused_a2a.set_uccl_num_sms(
    num_sms
)
```

Sets the number of SMs to use for UCCL-EP.

```python
nemo_automodel.components.moe.megatron.fused_a2a.uccl_fused_combine(
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

Perform fused combine using UCCL-EP.

```python
nemo_automodel.components.moe.megatron.fused_a2a.uccl_fused_dispatch(
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)
```

Perform fused dispatch using UCCL-EP.

```python
nemo_automodel.components.moe.megatron.fused_a2a.HAVE_DEEP_EP = True
```

```python
nemo_automodel.components.moe.megatron.fused_a2a.HAVE_HYBRIDEP = True
```

```python
nemo_automodel.components.moe.megatron.fused_a2a.HAVE_UCCL_EP = True
```

```python
nemo_automodel.components.moe.megatron.fused_a2a._buffer = None
```

```python
nemo_automodel.components.moe.megatron.fused_a2a._hybrid_ep_buffer = None
```

```python
nemo_automodel.components.moe.megatron.fused_a2a._nvshmem_available = None
```

```python
nemo_automodel.components.moe.megatron.fused_a2a._uccl_buffer = None
```