MoE Expert-Parallel Overlap Skill#
Stable docs: docs/training/communication-overlap.md
Card: card.yaml (co-located)
References#
Stable docs:
docs/training/communication-overlap.mdStructured metadata:
skills/perf-techniques/expert-parallel-overlap/card.yaml
What It Is#
Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all
communication by running it concurrently with expert FFN compute. Optionally,
delayed expert weight-gradient computation (delay_wgrad_compute) provides
additional overlap by deferring wgrad to overlap with the next layer’s forward.
Bridge supports two dispatcher paths:
Dispatcher |
Backend |
When to use |
|---|---|---|
|
Standard MoE all-to-all |
Default, broadest compatibility |
|
DeepEP or HybridEP |
Higher overlap on Ampere/Hopper/Blackwell |
Quick Decision#
Use EP overlap when:
the model is MoE with
EP > 1expert dispatch/combine communication is a meaningful part of step time
you have memory headroom and are tuning for throughput
Prefer:
alltoalldispatcher for the first rollout (broader compatibility)flex+ DeepEP/HybridEP when running on supported GPUs and seeking additional gains
Avoid EP overlap when:
full activation recompute is enabled
moe_shared_expert_overlapis enabledthe run is still being brought up for correctness
PyTorch < 2.6.0
Enablement#
alltoall dispatcher#
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False
flex dispatcher (DeepEP or HybridEP)#
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")
Compatibility And Constraints#
expert_model_parallel_size > 1num_moe_experts > 1moe_token_dispatcher_typemust be"alltoall"or"flex"moe_shared_expert_overlap = FalseBase precision is BF16 or FP16
PyTorch
>= 2.6.0If
PP > 1,virtual_pipeline_model_parallel_sizemust be setrecompute_granularity != "full",recompute_method = None,recompute_num_layers = Nonemtp_num_layersmust beNoneor1delay_wgrad_computerequiresoverlap_moe_expert_parallel_commas a prerequisitedelay_wgrad_computewithoverlap_grad_reducerequires TE >= 2.7.0delay_wgrad_computewithgradient_accumulation_fusionrequires TE >= 2.7.0CUDA graph
attnscope +delay_wgrad_computerequires TE >= 2.12.0,gradient_accumulation_fusion = True, and no attention biasDeepEP: Ampere, Hopper, B200, B300 GPUs only
HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72
Minimal Working Config#
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True
Minimal Runnable Command#
Performance harness example:
python scripts/performance/setup_experiment.py \
--model qwen3-30b-a3b \
--moe_a2a_overlap \
--num_nodes 2 \
--gpus_per_node 8 \
--max_steps 20
Unit test verification:
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py -k "moe" \
tests/unit_tests/training/test_deepep.py -q
Verification#
Unit tests#
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py \
tests/unit_tests/training/test_deepep.py -q
Log checks#
After a successful run with EP overlap:
Confirm no assertion errors during
CommOverlapConfigfinalizationConfirm
overlap_moe_expert_parallel_commappears asTruein the logged configIf using flex dispatcher, confirm
moe_token_dispatcher_type = "flex"and the correct backend in logs
Success criteria#
Config validation passes for the selected dispatcher and overlap settings
Training runs complete without hangs or assertion failures
Throughput improves or at least does not regress for the target workload
Loss trajectory matches baseline (overlap should not affect convergence)
Code Anchors#
Bridge overlap validation#
if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
assert model_cfg.expert_model_parallel_size > 1, ...
assert model_cfg.num_moe_experts > 1, ...
assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
assert model_cfg.bf16 or model_cfg.fp16, ...
assert is_torch_min_version("2.6.0"), ...
# ... PP + VPP check, recompute checks, shared_expert_overlap check ...
Delayed wgrad validation#
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
# TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
# CUDA graph scope validations for delayed wgrad
assert overlap_moe_expert_parallel_comm, ...
Flex-dispatcher activation#
def apply_flex_dispatcher_backend(...):
# GPU architecture check for DeepEP / HybridEP
model_config.moe_token_dispatcher_type = "flex"
model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
model_config.moe_shared_expert_overlap = False
Perf harness override#
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
if moe_a2a_overlap:
recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
recipe.comm_overlap.delay_wgrad_compute = True
recipe.model.moe_shared_expert_overlap = False
Tests#
File |
Coverage |
|---|---|
|
EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction |
|
DeepEP/HybridEP helper activation and GPU gating |
Failure Diagnosis#
Symptom |
Likely Cause |
How To Confirm |
Fix |
|---|---|---|---|
assert |
EP not configured |
Check |
Set EP > 1 |
assert |
Wrong dispatcher |
Check dispatcher type |
Use |
assert on BF16/FP16 |
Wrong precision |
Check |
Set |
hang during training |
PyTorch < 2.6 |
Check PyTorch version |
Upgrade to >= 2.6.0 |
assert |
PP > 1 without VPP |
Check PP and VPP config |
Set VPP when PP > 1 |
assert |
Full recompute enabled |
Check recompute settings |
Disable full recompute |
assert |
delayed wgrad without EP overlap |
Check |
Enable EP overlap first |
assert |
CUDA graph + delayed wgrad |
Check graph scope + wgrad settings |
Enable |
assert on attention bias |
CUDA graph attn + delayed wgrad + bias |
Check |
Disable attention bias |
no throughput gain from flex dispatcher |
|
Check |
Call |
DeepEP/HybridEP silently skipped |
Unsupported GPU |
Check warning logs |
Run on Ampere/Hopper/Blackwell |
Known Limitations#
Setting
moe_flex_dispatcher_backendalone does not activate flex dispatch — you must callapply_flex_dispatcher_backend(...).Public recipes are often conservative and leave MoE overlap disabled by default.
End-to-end throughput gains have not yet been measured in a controlled Bridge experiment. Code validation is stronger than performance evidence.
MoE overlap and shared-expert overlap are mutually exclusive.
CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.