MoE Expert-Parallel Overlap Skill#

Stable docs: docs/training/communication-overlap.md Card: card.yaml (co-located)

References#

Stable docs: docs/training/communication-overlap.md
Structured metadata: skills/perf-techniques/expert-parallel-overlap/card.yaml

What It Is#

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (delay_wgrad_compute) provides additional overlap by deferring wgrad to overlap with the next layer’s forward.

Bridge supports two dispatcher paths:

Dispatcher	Backend	When to use
`alltoall`	Standard MoE all-to-all	Default, broadest compatibility
`flex`	DeepEP or HybridEP	Higher overlap on Ampere/Hopper/Blackwell

Quick Decision#

Use EP overlap when:

the model is MoE with EP > 1
expert dispatch/combine communication is a meaningful part of step time
you have memory headroom and are tuning for throughput

Prefer:

alltoall dispatcher for the first rollout (broader compatibility)
flex + DeepEP/HybridEP when running on supported GPUs and seeking additional gains

Avoid EP overlap when:

full activation recompute is enabled
moe_shared_expert_overlap is enabled
the run is still being brought up for correctness
PyTorch < 2.6.0

Enablement#

alltoall dispatcher#

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

flex dispatcher (DeepEP or HybridEP)#

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

Compatibility And Constraints#

expert_model_parallel_size > 1
num_moe_experts > 1
moe_token_dispatcher_type must be "alltoall" or "flex"
moe_shared_expert_overlap = False
Base precision is BF16 or FP16
PyTorch >= 2.6.0
If PP > 1, virtual_pipeline_model_parallel_size must be set
recompute_granularity != "full", recompute_method = None, recompute_num_layers = None
mtp_num_layers must be None or 1
delay_wgrad_compute requires overlap_moe_expert_parallel_comm as a prerequisite
delay_wgrad_compute with overlap_grad_reduce requires TE >= 2.7.0
delay_wgrad_compute with gradient_accumulation_fusion requires TE >= 2.7.0
CUDA graph attn scope + delay_wgrad_compute requires TE >= 2.12.0, gradient_accumulation_fusion = True, and no attention bias
DeepEP: Ampere, Hopper, B200, B300 GPUs only
HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72

Minimal Working Config#

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

Minimal Runnable Command#

Performance harness example:

python scripts/performance/setup_experiment.py \
  --model qwen3-30b-a3b \
  --moe_a2a_overlap \
  --num_nodes 2 \
  --gpus_per_node 8 \
  --max_steps 20

Unit test verification:

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Verification#

Unit tests#

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

Log checks#

After a successful run with EP overlap:

Confirm no assertion errors during CommOverlapConfig finalization
Confirm overlap_moe_expert_parallel_comm appears as True in the logged config
If using flex dispatcher, confirm moe_token_dispatcher_type = "flex" and the correct backend in logs

Success criteria#

Config validation passes for the selected dispatcher and overlap settings
Training runs complete without hangs or assertion failures
Throughput improves or at least does not regress for the target workload
Loss trajectory matches baseline (overlap should not affect convergence)

Code Anchors#

Bridge overlap validation#

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...

Delayed wgrad validation#

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...

Flex-dispatcher activation#

def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

Perf harness override#

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

Tests#

File	Coverage
`tests/unit_tests/training/test_comm_overlap.py`	EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction
`tests/unit_tests/training/test_deepep.py`	DeepEP/HybridEP helper activation and GPU gating

Failure Diagnosis#

Symptom	Likely Cause	How To Confirm	Fix
assert `expert_model_parallel_size > 1`	EP not configured	Check `expert_model_parallel_size`	Set EP > 1
assert `moe_token_dispatcher_type`	Wrong dispatcher	Check dispatcher type	Use `"alltoall"` or `"flex"`
assert on BF16/FP16	Wrong precision	Check `bf16` and `fp16`	Set `bf16 = True`
hang during training	PyTorch < 2.6	Check PyTorch version	Upgrade to >= 2.6.0
assert `virtual_pipeline_model_parallel_size`	PP > 1 without VPP	Check PP and VPP config	Set VPP when PP > 1
assert `recompute_granularity`	Full recompute enabled	Check recompute settings	Disable full recompute
assert `overlap_moe_expert_parallel_comm required`	delayed wgrad without EP overlap	Check `delay_wgrad_compute` without overlap	Enable EP overlap first
assert `gradient_accumulation_fusion`	CUDA graph + delayed wgrad	Check graph scope + wgrad settings	Enable `gradient_accumulation_fusion`
assert on attention bias	CUDA graph attn + delayed wgrad + bias	Check `add_bias_linear` / `add_qkv_bias`	Disable attention bias
no throughput gain from flex dispatcher	`apply_flex_dispatcher_backend` not called	Check `moe_token_dispatcher_type` in logs	Call `apply_flex_dispatcher_backend(...)`
DeepEP/HybridEP silently skipped	Unsupported GPU	Check warning logs	Run on Ampere/Hopper/Blackwell

Known Limitations#

Setting moe_flex_dispatcher_backend alone does not activate flex dispatch — you must call apply_flex_dispatcher_backend(...).
Public recipes are often conservative and leave MoE overlap disabled by default.
End-to-end throughput gains have not yet been measured in a controlled Bridge experiment. Code validation is stronger than performance evidence.
MoE overlap and shared-expert overlap are mutually exclusive.
CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.