MoE Expert-Parallel Overlap Skill#

Stable docs: docs/training/communication-overlap.md Card: card.yaml (co-located)

References#

  • Stable docs: docs/training/communication-overlap.md

  • Structured metadata: skills/perf-techniques/expert-parallel-overlap/card.yaml

What It Is#

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (delay_wgrad_compute) provides additional overlap by deferring wgrad to overlap with the next layer’s forward.

Bridge supports two dispatcher paths:

Dispatcher

Backend

When to use

alltoall

Standard MoE all-to-all

Default, broadest compatibility

flex

DeepEP or HybridEP

Higher overlap on Ampere/Hopper/Blackwell

Quick Decision#

Use EP overlap when:

  • the model is MoE with EP > 1

  • expert dispatch/combine communication is a meaningful part of step time

  • you have memory headroom and are tuning for throughput

Prefer:

  • alltoall dispatcher for the first rollout (broader compatibility)

  • flex + DeepEP/HybridEP when running on supported GPUs and seeking additional gains

Avoid EP overlap when:

  • full activation recompute is enabled

  • moe_shared_expert_overlap is enabled

  • the run is still being brought up for correctness

  • PyTorch < 2.6.0

Enablement#

alltoall dispatcher#

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

flex dispatcher (DeepEP or HybridEP)#

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

Compatibility And Constraints#

  • expert_model_parallel_size > 1

  • num_moe_experts > 1

  • moe_token_dispatcher_type must be "alltoall" or "flex"

  • moe_shared_expert_overlap = False

  • Base precision is BF16 or FP16

  • PyTorch >= 2.6.0

  • If PP > 1, virtual_pipeline_model_parallel_size must be set

  • recompute_granularity != "full", recompute_method = None, recompute_num_layers = None

  • mtp_num_layers must be None or 1

  • delay_wgrad_compute requires overlap_moe_expert_parallel_comm as a prerequisite

  • delay_wgrad_compute with overlap_grad_reduce requires TE >= 2.7.0

  • delay_wgrad_compute with gradient_accumulation_fusion requires TE >= 2.7.0

  • CUDA graph attn scope + delay_wgrad_compute requires TE >= 2.12.0, gradient_accumulation_fusion = True, and no attention bias

  • DeepEP: Ampere, Hopper, B200, B300 GPUs only

  • HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72

Minimal Working Config#

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

Minimal Runnable Command#

Performance harness example:

python scripts/performance/setup_experiment.py \
  --model qwen3-30b-a3b \
  --moe_a2a_overlap \
  --num_nodes 2 \
  --gpus_per_node 8 \
  --max_steps 20

Unit test verification:

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Verification#

Unit tests#

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

Log checks#

After a successful run with EP overlap:

  1. Confirm no assertion errors during CommOverlapConfig finalization

  2. Confirm overlap_moe_expert_parallel_comm appears as True in the logged config

  3. If using flex dispatcher, confirm moe_token_dispatcher_type = "flex" and the correct backend in logs

Success criteria#

  • Config validation passes for the selected dispatcher and overlap settings

  • Training runs complete without hangs or assertion failures

  • Throughput improves or at least does not regress for the target workload

  • Loss trajectory matches baseline (overlap should not affect convergence)

Code Anchors#

Bridge overlap validation#

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...

Delayed wgrad validation#

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...

Flex-dispatcher activation#

def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

Perf harness override#

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

Tests#

File

Coverage

tests/unit_tests/training/test_comm_overlap.py

EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction

tests/unit_tests/training/test_deepep.py

DeepEP/HybridEP helper activation and GPU gating

Failure Diagnosis#

Symptom

Likely Cause

How To Confirm

Fix

assert expert_model_parallel_size > 1

EP not configured

Check expert_model_parallel_size

Set EP > 1

assert moe_token_dispatcher_type

Wrong dispatcher

Check dispatcher type

Use "alltoall" or "flex"

assert on BF16/FP16

Wrong precision

Check bf16 and fp16

Set bf16 = True

hang during training

PyTorch < 2.6

Check PyTorch version

Upgrade to >= 2.6.0

assert virtual_pipeline_model_parallel_size

PP > 1 without VPP

Check PP and VPP config

Set VPP when PP > 1

assert recompute_granularity

Full recompute enabled

Check recompute settings

Disable full recompute

assert overlap_moe_expert_parallel_comm required

delayed wgrad without EP overlap

Check delay_wgrad_compute without overlap

Enable EP overlap first

assert gradient_accumulation_fusion

CUDA graph + delayed wgrad

Check graph scope + wgrad settings

Enable gradient_accumulation_fusion

assert on attention bias

CUDA graph attn + delayed wgrad + bias

Check add_bias_linear / add_qkv_bias

Disable attention bias

no throughput gain from flex dispatcher

apply_flex_dispatcher_backend not called

Check moe_token_dispatcher_type in logs

Call apply_flex_dispatcher_backend(...)

DeepEP/HybridEP silently skipped

Unsupported GPU

Check warning logs

Run on Ampere/Hopper/Blackwell

Known Limitations#

  • Setting moe_flex_dispatcher_backend alone does not activate flex dispatch — you must call apply_flex_dispatcher_backend(...).

  • Public recipes are often conservative and leave MoE overlap disabled by default.

  • End-to-end throughput gains have not yet been measured in a controlled Bridge experiment. Code validation is stronger than performance evidence.

  • MoE overlap and shared-expert overlap are mutually exclusive.

  • CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.