nemo_automodel.components.moe.megatron.fused_a2a

Module Contents

Classes

Name	Description
`FusedCombine`	Fused combine operation for MoE output combining computation and communication.
`FusedDispatch`	Fused dispatch operation for MoE routing combining computation and communication.
`HybridEPCombine`	Fused combine operation for permute + combine a2a + permute using the HybridEP backend.
`HybridEPDispatch`	Fused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend.
`UCCLFusedCombine`	Fused combine using UCCL-EP instead of DeepEP.
`UCCLFusedDispatch`	Fused dispatch using UCCL-EP instead of DeepEP.

Functions

Name	Description
`_is_nvshmem_available`	Check if DeepEP was compiled with NVSHMEM support.
`free_buffer`	Destroy the global DeepEP `Buffer` and release its NVSHMEM/cpp runtime.
`fused_combine`	Perform fused combine operation if deep_ep is available.
`fused_dispatch`	Perform fused dispatch operation if deep_ep is available.
`get_buffer`	Get or create a buffer for all-to-all communication.
`get_hidden_bytes`	Calculate the number of hidden bytes for a tensor.
`get_uccl_buffer`	Get or create a UCCL-EP buffer for all-to-all communication.
`hybrid_ep_combine`	Perform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend.
`hybrid_ep_dispatch`	Perform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.
`init_hybrid_ep_buffer`	Initialize the HybridEP buffer, including buffer allocation and metadata initialization.
`reset_hybrid_ep_buffer`	Reset the HybridEP buffer.
`set_deepep_num_sms`	Sets the number of SMs to use for DeepEP.
`set_uccl_num_sms`	Sets the number of SMs to use for UCCL-EP.
`uccl_fused_combine`	Perform fused combine using UCCL-EP.
`uccl_fused_dispatch`	Perform fused dispatch using UCCL-EP.

Data

API

class nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine()

Bases: Function

Fused combine operation for MoE output combining computation and communication.

nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine.backward(
    ctx,
    grad_output,
    previous_event = None
)

staticmethod

Backward pass of fused combine.

nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine.forward(
    ctx,
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)

staticmethod

Forward pass of fused combine.

class nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch()

Bases: Function

Fused dispatch operation for MoE routing combining computation and communication.

nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch.backward(
    ctx,
    grad_output,
    grad_token_indices,
    grad_token_probs,
    grad_tokens_per_expert,
    grad_handle
)

staticmethod

Backward pass of fused dispatch.

nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch.forward(
    ctx,
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)

staticmethod

Forward pass of fused dispatch.

class nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine()

Bases: Function

Fused combine operation for permute + combine a2a + permute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine.backward(
    ctx,
    grad_x
)

staticmethod

Backward pass of fused combine of the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine.forward(
    ctx,
    x,
    handle,
    num_permuted_tokens = None,
    pad_multiple = None
)

staticmethod

Forward pass of fused combine of the HybridEP backend.

class nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch()

Bases: Function

Fused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch.backward(
    ctx,
    grad_x,
    grad_probs,
    grad_scaling_factor,
    grad_tokens_per_expert,
    grad_handle
)

staticmethod

Backward pass of fused dispatch of the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch.forward(
    ctx,
    x,
    routing_map,
    probs,
    group,
    num_local_experts,
    num_sms_dispatch_api = 24,
    num_sms_combine_api = 24,
    num_permuted_tokens = None,
    pad_multiple = None
)

staticmethod

Forward pass of fused dispatch of the HybridEP backend.

class nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine()

Bases: Function

Fused combine using UCCL-EP instead of DeepEP.

nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine.backward(
    ctx,
    grad_output,
    _grad_event = None
)

staticmethod

nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine.forward(
    ctx,
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)

staticmethod

class nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch()

Bases: Function

Fused dispatch using UCCL-EP instead of DeepEP.

nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch.backward(
    ctx,
    grad_output,
    grad_token_indices,
    grad_token_probs,
    grad_tokens_per_expert,
    grad_handle
)

staticmethod

nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch.forward(
    ctx,
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)

staticmethod

nemo_automodel.components.moe.megatron.fused_a2a._is_nvshmem_available() -> bool

Check if DeepEP was compiled with NVSHMEM support.

Uses is_sm90_compiled() as proxy — DeepEP’s build enforces that NVSHMEM is disabled when SM90 features are disabled.

nemo_automodel.components.moe.megatron.fused_a2a.free_buffer() -> None

Destroy the global DeepEP Buffer and release its NVSHMEM/cpp runtime.

DeepEP keeps a process-global communication buffer backed by NVSHMEM symmetric memory. It is normally never torn down (destroy_process_group hangs on DeepEP’s NCCL sub-groups, so cleanup is skipped), but that leftover GPU state survives process exit for the whole Slurm allocation and corrupts subsequent forwards. Destroying the buffer first frees the runtime and lets a clean destroy_process_group follow without hanging.

nemo_automodel.components.moe.megatron.fused_a2a.fused_combine(
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)

Perform fused combine operation if deep_ep is available.

Parameters:

Input tensor

group

Process group

handle

Communication handle

previous_event

Previous CUDA event

Returns:

Result of FusedCombine

nemo_automodel.components.moe.megatron.fused_a2a.fused_dispatch(
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)

Perform fused dispatch operation if deep_ep is available.

Parameters:

Input tensor [num_tokens, hidden_size]

token_indices

Token routing indices [num_tokens, topk]

token_probs

Token routing probabilities [num_tokens, topk]

num_experts

Number of experts

group

Process group

previous_event

Previous CUDA event

Returns:

Result of FusedDispatch

nemo_automodel.components.moe.megatron.fused_a2a.get_buffer(
    group: torch.distributed.ProcessGroup,
    hidden_bytes: int
)

Get or create a buffer for all-to-all communication.

Parameters:

group

torch.distributed.ProcessGroup

Process group for communication

hidden_bytes

int

Number of hidden bytes needed

Returns:

Communication buffer

nemo_automodel.components.moe.megatron.fused_a2a.get_hidden_bytes(
    x: torch.Tensor
) -> int

Calculate the number of hidden bytes for a tensor.

Parameters:

torch.Tensor

Input tensor

Returns: int

Number of hidden bytes

nemo_automodel.components.moe.megatron.fused_a2a.get_uccl_buffer(
    group: torch.distributed.ProcessGroup,
    hidden_bytes: int
)

Get or create a UCCL-EP buffer for all-to-all communication.

nemo_automodel.components.moe.megatron.fused_a2a.hybrid_ep_combine(
    x,
    handle,
    num_permuted_tokens = None,
    pad_multiple = None
)

Perform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.hybrid_ep_dispatch(
    x,
    routing_map,
    probs,
    group,
    num_local_experts,
    num_sms_dispatch_api = 24,
    num_sms_combine_api = 24,
    num_permuted_tokens = None,
    pad_multiple = None
)

Perform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.init_hybrid_ep_buffer(
    group: torch.distributed.ProcessGroup,
    hidden_dim: int,
    seq_len: int,
    num_local_experts: int,
    num_sms_dispatch_api: int,
    num_sms_combine_api: int,
    fp8_dispatch: bool
) -> None

Initialize the HybridEP buffer, including buffer allocation and metadata initialization.

If a runtime dispatch/combine requires a larger buffer than the one initialized, the buffer will be reallocated at runtime, incuring extra run-time overhead.

Parameters:

group

torch.distributed.ProcessGroup

Process group for HybridEP all-to-all communication.

hidden_dim

int

Hidden dimension of the input tensor.

seq_len

int

Maximum sequence length of the input tensor.

num_local_experts

int

Number of local experts.

num_sms_dispatch_api

int

Number of SMs used by the dispatch API.

num_sms_combine_api

int

Number of SMs used by the combine API.

fp8_dispatch

bool

Whether to use FP8 communication during the dispatch phase.

nemo_automodel.components.moe.megatron.fused_a2a.reset_hybrid_ep_buffer()

Reset the HybridEP buffer.

nemo_automodel.components.moe.megatron.fused_a2a.set_deepep_num_sms(
    num_sms
)

Sets the number of SMs to use for DeepEP.

nemo_automodel.components.moe.megatron.fused_a2a.set_uccl_num_sms(
    num_sms
)

Sets the number of SMs to use for UCCL-EP.

nemo_automodel.components.moe.megatron.fused_a2a.uccl_fused_combine(
    x,
    group,
    handle,
    async_finish = False,
    allocate_on_comm_stream = False
)

Perform fused combine using UCCL-EP.

nemo_automodel.components.moe.megatron.fused_a2a.uccl_fused_dispatch(
    x,
    token_indices,
    token_probs,
    num_experts,
    group,
    async_finish = False,
    allocate_on_comm_stream = False
)

Perform fused dispatch using UCCL-EP.

nemo_automodel.components.moe.megatron.fused_a2a.HAVE_DEEP_EP = True

nemo_automodel.components.moe.megatron.fused_a2a.HAVE_HYBRIDEP = True

nemo_automodel.components.moe.megatron.fused_a2a.HAVE_UCCL_EP = True

nemo_automodel.components.moe.megatron.fused_a2a._buffer = None

nemo_automodel.components.moe.megatron.fused_a2a._hybrid_ep_buffer = None

nemo_automodel.components.moe.megatron.fused_a2a._nvshmem_available = None

nemo_automodel.components.moe.megatron.fused_a2a._uccl_buffer = None