nemo_automodel.components.moe.megatron.fused_a2a

View as Markdown

Module Contents

Classes

NameDescription
FusedCombineFused combine operation for MoE output combining computation and communication.
FusedDispatchFused dispatch operation for MoE routing combining computation and communication.
HybridEPCombineFused combine operation for permute + combine a2a + permute using the HybridEP backend.
HybridEPDispatchFused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend.
UCCLFusedCombineFused combine using UCCL-EP instead of DeepEP.
UCCLFusedDispatchFused dispatch using UCCL-EP instead of DeepEP.

Functions

NameDescription
_is_nvshmem_availableCheck if DeepEP was compiled with NVSHMEM support.
free_bufferDestroy the global DeepEP Buffer and release its NVSHMEM/cpp runtime.
fused_combinePerform fused combine operation if deep_ep is available.
fused_dispatchPerform fused dispatch operation if deep_ep is available.
get_bufferGet or create a buffer for all-to-all communication.
get_hidden_bytesCalculate the number of hidden bytes for a tensor.
get_uccl_bufferGet or create a UCCL-EP buffer for all-to-all communication.
hybrid_ep_combinePerform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend.
hybrid_ep_dispatchPerform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.
init_hybrid_ep_bufferInitialize the HybridEP buffer, including buffer allocation and metadata initialization.
reset_hybrid_ep_bufferReset the HybridEP buffer.
set_deepep_num_smsSets the number of SMs to use for DeepEP.
set_uccl_num_smsSets the number of SMs to use for UCCL-EP.
uccl_fused_combinePerform fused combine using UCCL-EP.
uccl_fused_dispatchPerform fused dispatch using UCCL-EP.

Data

HAVE_DEEP_EP

HAVE_HYBRIDEP

HAVE_UCCL_EP

_buffer

_hybrid_ep_buffer

_nvshmem_available

_uccl_buffer

API

class nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine()

Bases: Function

Fused combine operation for MoE output combining computation and communication.

nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine.backward(
ctx,
grad_output,
previous_event = None
)
staticmethod

Backward pass of fused combine.

nemo_automodel.components.moe.megatron.fused_a2a.FusedCombine.forward(
ctx,
x,
group,
handle,
async_finish = False,
allocate_on_comm_stream = False
)
staticmethod

Forward pass of fused combine.

class nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch()

Bases: Function

Fused dispatch operation for MoE routing combining computation and communication.

nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch.backward(
ctx,
grad_output,
grad_token_indices,
grad_token_probs,
grad_tokens_per_expert,
grad_handle
)
staticmethod

Backward pass of fused dispatch.

nemo_automodel.components.moe.megatron.fused_a2a.FusedDispatch.forward(
ctx,
x,
token_indices,
token_probs,
num_experts,
group,
async_finish = False,
allocate_on_comm_stream = False
)
staticmethod

Forward pass of fused dispatch.

class nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine()

Bases: Function

Fused combine operation for permute + combine a2a + permute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine.backward(
ctx,
grad_x
)
staticmethod

Backward pass of fused combine of the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPCombine.forward(
ctx,
x,
handle,
num_permuted_tokens = None,
pad_multiple = None
)
staticmethod

Forward pass of fused combine of the HybridEP backend.

class nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch()

Bases: Function

Fused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch.backward(
ctx,
grad_x,
grad_probs,
grad_scaling_factor,
grad_tokens_per_expert,
grad_handle
)
staticmethod

Backward pass of fused dispatch of the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.HybridEPDispatch.forward(
ctx,
x,
routing_map,
probs,
group,
num_local_experts,
num_sms_dispatch_api = 24,
num_sms_combine_api = 24,
num_permuted_tokens = None,
pad_multiple = None
)
staticmethod

Forward pass of fused dispatch of the HybridEP backend.

class nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine()

Bases: Function

Fused combine using UCCL-EP instead of DeepEP.

nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine.backward(
ctx,
grad_output,
_grad_event = None
)
staticmethod
nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedCombine.forward(
ctx,
x,
group,
handle,
async_finish = False,
allocate_on_comm_stream = False
)
staticmethod
class nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch()

Bases: Function

Fused dispatch using UCCL-EP instead of DeepEP.

nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch.backward(
ctx,
grad_output,
grad_token_indices,
grad_token_probs,
grad_tokens_per_expert,
grad_handle
)
staticmethod
nemo_automodel.components.moe.megatron.fused_a2a.UCCLFusedDispatch.forward(
ctx,
x,
token_indices,
token_probs,
num_experts,
group,
async_finish = False,
allocate_on_comm_stream = False
)
staticmethod
nemo_automodel.components.moe.megatron.fused_a2a._is_nvshmem_available() -> bool

Check if DeepEP was compiled with NVSHMEM support.

Uses is_sm90_compiled() as proxy — DeepEP’s build enforces that NVSHMEM is disabled when SM90 features are disabled.

nemo_automodel.components.moe.megatron.fused_a2a.free_buffer() -> None

Destroy the global DeepEP Buffer and release its NVSHMEM/cpp runtime.

DeepEP keeps a process-global communication buffer backed by NVSHMEM symmetric memory. It is normally never torn down (destroy_process_group hangs on DeepEP’s NCCL sub-groups, so cleanup is skipped), but that leftover GPU state survives process exit for the whole Slurm allocation and corrupts subsequent forwards. Destroying the buffer first frees the runtime and lets a clean destroy_process_group follow without hanging.

nemo_automodel.components.moe.megatron.fused_a2a.fused_combine(
x,
group,
handle,
async_finish = False,
allocate_on_comm_stream = False
)

Perform fused combine operation if deep_ep is available.

Parameters:

x

Input tensor

group

Process group

handle

Communication handle

previous_event

Previous CUDA event

Returns:

Result of FusedCombine

nemo_automodel.components.moe.megatron.fused_a2a.fused_dispatch(
x,
token_indices,
token_probs,
num_experts,
group,
async_finish = False,
allocate_on_comm_stream = False
)

Perform fused dispatch operation if deep_ep is available.

Parameters:

x

Input tensor [num_tokens, hidden_size]

token_indices

Token routing indices [num_tokens, topk]

token_probs

Token routing probabilities [num_tokens, topk]

num_experts

Number of experts

group

Process group

previous_event

Previous CUDA event

Returns:

Result of FusedDispatch

nemo_automodel.components.moe.megatron.fused_a2a.get_buffer(
group: torch.distributed.ProcessGroup,
hidden_bytes: int
)

Get or create a buffer for all-to-all communication.

Parameters:

group
torch.distributed.ProcessGroup

Process group for communication

hidden_bytes
int

Number of hidden bytes needed

Returns:

Communication buffer

nemo_automodel.components.moe.megatron.fused_a2a.get_hidden_bytes(
x: torch.Tensor
) -> int

Calculate the number of hidden bytes for a tensor.

Parameters:

x
torch.Tensor

Input tensor

Returns: int

Number of hidden bytes

nemo_automodel.components.moe.megatron.fused_a2a.get_uccl_buffer(
group: torch.distributed.ProcessGroup,
hidden_bytes: int
)

Get or create a UCCL-EP buffer for all-to-all communication.

nemo_automodel.components.moe.megatron.fused_a2a.hybrid_ep_combine(
x,
handle,
num_permuted_tokens = None,
pad_multiple = None
)

Perform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.hybrid_ep_dispatch(
x,
routing_map,
probs,
group,
num_local_experts,
num_sms_dispatch_api = 24,
num_sms_combine_api = 24,
num_permuted_tokens = None,
pad_multiple = None
)

Perform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.

nemo_automodel.components.moe.megatron.fused_a2a.init_hybrid_ep_buffer(
group: torch.distributed.ProcessGroup,
hidden_dim: int,
seq_len: int,
num_local_experts: int,
num_sms_dispatch_api: int,
num_sms_combine_api: int,
fp8_dispatch: bool
) -> None

Initialize the HybridEP buffer, including buffer allocation and metadata initialization.

If a runtime dispatch/combine requires a larger buffer than the one initialized, the buffer will be reallocated at runtime, incuring extra run-time overhead.

Parameters:

group
torch.distributed.ProcessGroup

Process group for HybridEP all-to-all communication.

hidden_dim
int

Hidden dimension of the input tensor.

seq_len
int

Maximum sequence length of the input tensor.

num_local_experts
int

Number of local experts.

num_sms_dispatch_api
int

Number of SMs used by the dispatch API.

num_sms_combine_api
int

Number of SMs used by the combine API.

fp8_dispatch
bool

Whether to use FP8 communication during the dispatch phase.

nemo_automodel.components.moe.megatron.fused_a2a.reset_hybrid_ep_buffer()

Reset the HybridEP buffer.

nemo_automodel.components.moe.megatron.fused_a2a.set_deepep_num_sms(
num_sms
)

Sets the number of SMs to use for DeepEP.

nemo_automodel.components.moe.megatron.fused_a2a.set_uccl_num_sms(
num_sms
)

Sets the number of SMs to use for UCCL-EP.

nemo_automodel.components.moe.megatron.fused_a2a.uccl_fused_combine(
x,
group,
handle,
async_finish = False,
allocate_on_comm_stream = False
)

Perform fused combine using UCCL-EP.

nemo_automodel.components.moe.megatron.fused_a2a.uccl_fused_dispatch(
x,
token_indices,
token_probs,
num_experts,
group,
async_finish = False,
allocate_on_comm_stream = False
)

Perform fused dispatch using UCCL-EP.

nemo_automodel.components.moe.megatron.fused_a2a.HAVE_DEEP_EP = True
nemo_automodel.components.moe.megatron.fused_a2a.HAVE_HYBRIDEP = True
nemo_automodel.components.moe.megatron.fused_a2a.HAVE_UCCL_EP = True
nemo_automodel.components.moe.megatron.fused_a2a._buffer = None
nemo_automodel.components.moe.megatron.fused_a2a._hybrid_ep_buffer = None
nemo_automodel.components.moe.megatron.fused_a2a._nvshmem_available = None
nemo_automodel.components.moe.megatron.fused_a2a._uccl_buffer = None