MoE Communication Overlap#
For the higher-level overview, see:
@docs/training/communication-overlap.md
@skills/nemo-mbridge-perf-moe-comm-overlap/card.yaml
Quick Decision#
Use MoE communication overlap when:
EP > 1token dispatch or combine time is visible in the profile
the run is already correct and you are now tuning throughput
Avoid turning it on as an early bring-up step. It is easier to validate after the dispatcher, routing mode, and recompute plan are already stable.
Enablement#
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
# Optional: delayed wgrad for additional overlap
cfg.comm_overlap.delay_wgrad_compute = True
# IMPORTANT: disable shared expert overlap when using dispatch overlap
cfg.model.moe_shared_expert_overlap = False
Prerequisites#
expert_model_parallel_size > 1num_moe_experts > 1moe_token_dispatcher_typemust be"alltoall"or"flex"Precision: BF16 or FP16
If PP is used, VPP (
virtual_pipeline_model_parallel_size) must be set (non-None)
Flex dispatcher activation#
Setting moe_flex_dispatcher_backend alone does not activate flex dispatch.
You must also set moe_token_dispatcher_type = "flex".
Recompute And CUDA Graph Interaction#
Full recompute is not a good companion for the overlap path.
delay_wgrad_computeadds further constraints if CUDA-graph scopes include attention or MoE-router work.In practice, selective recompute is the safer pairing when overlap is enabled.
Measured Short-Run Caveat#
A 2026-05-18 current-main H100 x16 smoke on Qwen3 30B-A3B mock pretraining
used EP=16, alltoall, global batch size 1024, CUDA graphs disabled, and
moe_permute_fusion=false because the PyTorch 25.11 / TE / Triton stack failed
in Transformer Engine fused permutation in prior bring-up.
Results were directional rather than release-grade:
no EP overlap: 41.25s steady-state mean over iterations 3-8
EP overlap: 31.31s steady-state mean over iterations 3-8
EP overlap plus
delay_wgrad_compute: 31.20s steady-state mean over iterations 3-8
Treat this as evidence that EP overlap can help an inter-node alltoall MoE
shape when communication is exposed. It is not proof that delayed wgrad is a
separate win, and it does not validate the fused permutation path. An earlier
2026-05-16 short smoke on the same shape showed the same pattern.
Code Anchors#
Overlap validation:
src/megatron/bridge/training/comm_overlap.pyFlex dispatcher backend:
src/megatron/bridge/training/flex_dispatcher_backend.pyConfig:
src/megatron/bridge/training/config.pyUnit tests:
tests/unit_tests/training/test_comm_overlap.pyDeepEP tests:
tests/unit_tests/training/test_deepep.py
Pitfalls#
Shared expert overlap conflict:
moe_shared_expert_overlapandoverlap_moe_expert_parallel_commcan conflict. Disable shared expert overlap when using the dispatch overlap path.PP without VPP: MoE overlap requires VPP when pipeline parallelism is active. Without it, the overlap scheduling cannot interleave correctly.
Flex != backend flag:
moe_flex_dispatcher_backend="deepep"alone does nothing ifmoe_token_dispatcher_typeis still"alltoall".Conservative recipe defaults: Most public recipes leave MoE overlap disabled. You need to explicitly enable it via overrides.
Performance gains are workload-dependent: overlap helps most when dispatch communication is already a visible slice of step time. It is not guaranteed to help every small or lightly loaded EP run.
Verification#
Look for overlap-related log messages during initialization. The comm overlap
validation in comm_overlap.py will raise if prerequisites are not met, so a
clean startup confirms the feature is active.
For a short performance-harness smoke, keep the command shape explicit and vary only one overlap knob at a time:
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
-gn 8 \
--max_steps 8 \
--config_variant v1 \
--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
--tokenizer_type NullTokenizer \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=false
If fused MoE permutation fails during bring-up, add
model.moe_permute_fusion=false to separate overlap timing from runtime-stack
validation, then retest with the matched production container.