nemo_automodel.components.moe.megatron.fused_a2a
nemo_automodel.components.moe.megatron.fused_a2a
Module Contents
Classes
Functions
Data
API
Bases: Function
Fused combine operation for MoE output combining computation and communication.
Backward pass of fused combine.
Forward pass of fused combine.
Bases: Function
Fused dispatch operation for MoE routing combining computation and communication.
Backward pass of fused dispatch.
Forward pass of fused dispatch.
Bases: Function
Fused combine operation for permute + combine a2a + permute using the HybridEP backend.
Backward pass of fused combine of the HybridEP backend.
Forward pass of fused combine of the HybridEP backend.
Bases: Function
Fused dispatch operation for permute + dispatch a2a + permute using the HybridEP backend.
Backward pass of fused dispatch of the HybridEP backend.
Forward pass of fused dispatch of the HybridEP backend.
Bases: Function
Fused combine using UCCL-EP instead of DeepEP.
Bases: Function
Fused dispatch using UCCL-EP instead of DeepEP.
Check if DeepEP was compiled with NVSHMEM support.
Uses is_sm90_compiled() as proxy — DeepEP’s build enforces that NVSHMEM is disabled when SM90 features are disabled.
Destroy the global DeepEP Buffer and release its NVSHMEM/cpp runtime.
DeepEP keeps a process-global communication buffer backed by NVSHMEM symmetric memory.
It is normally never torn down (destroy_process_group hangs on DeepEP’s NCCL
sub-groups, so cleanup is skipped), but that leftover GPU state survives process exit for
the whole Slurm allocation and corrupts subsequent forwards. Destroying the buffer first
frees the runtime and lets a clean destroy_process_group follow without hanging.
Perform fused combine operation if deep_ep is available.
Parameters:
Input tensor
Process group
Communication handle
Previous CUDA event
Returns:
Result of FusedCombine
Perform fused dispatch operation if deep_ep is available.
Parameters:
Input tensor [num_tokens, hidden_size]
Token routing indices [num_tokens, topk]
Token routing probabilities [num_tokens, topk]
Number of experts
Process group
Previous CUDA event
Returns:
Result of FusedDispatch
Get or create a buffer for all-to-all communication.
Parameters:
Process group for communication
Number of hidden bytes needed
Returns:
Communication buffer
Calculate the number of hidden bytes for a tensor.
Parameters:
Input tensor
Returns: int
Number of hidden bytes
Get or create a UCCL-EP buffer for all-to-all communication.
Perform fused combine for unpermute + combine a2a + unpermute using the HybridEP backend.
Perform fused dispatch for permute + dispatch a2a + permute using the HybridEP backend.
Initialize the HybridEP buffer, including buffer allocation and metadata initialization.
If a runtime dispatch/combine requires a larger buffer than the one initialized, the buffer will be reallocated at runtime, incuring extra run-time overhead.
Parameters:
Process group for HybridEP all-to-all communication.
Hidden dimension of the input tensor.
Maximum sequence length of the input tensor.
Number of local experts.
Number of SMs used by the dispatch API.
Number of SMs used by the combine API.
Whether to use FP8 communication during the dispatch phase.
Reset the HybridEP buffer.
Sets the number of SMs to use for DeepEP.
Sets the number of SMs to use for UCCL-EP.
Perform fused combine using UCCL-EP.
Perform fused dispatch using UCCL-EP.