MoE Communication Overlap#
For what MoE communication overlap is and when to use it, see:
docs/training/communication-overlap.mdcard.yaml(co-located)
Enablement#
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
# Optional: delayed wgrad for additional overlap
cfg.model.moe_delay_wgrad_compute = True
# IMPORTANT: disable shared expert overlap when using dispatch overlap
cfg.model.moe_shared_expert_overlap = False
Prerequisites#
expert_model_parallel_size > 1num_moe_experts > 1moe_token_dispatcher_typemust be"alltoall"or"flex"Precision: BF16 or FP16
If PP is used, VPP (
virtual_pipeline_model_parallel_size > 1) is required
Flex dispatcher activation#
Setting moe_flex_dispatcher_backend alone does not activate flex dispatch.
You must also set moe_token_dispatcher_type = "flex".
Code Anchors#
Overlap validation:
src/megatron/bridge/training/comm_overlap.pyFlex dispatcher backend:
src/megatron/bridge/training/flex_dispatcher_backend.pyConfig:
src/megatron/bridge/training/config.pyUnit tests:
tests/unit_tests/training/test_comm_overlap.pyDeepEP tests:
tests/unit_tests/training/test_deepep.py
Pitfalls#
Shared expert overlap conflict:
moe_shared_expert_overlapandoverlap_moe_expert_parallel_commcan conflict. Disable shared expert overlap when using the dispatch overlap path.PP without VPP: MoE overlap requires VPP when pipeline parallelism is active. Without it, the overlap scheduling cannot interleave correctly.
Flex != backend flag:
moe_flex_dispatcher_backend="deepep"alone does nothing ifmoe_token_dispatcher_typeis still"alltoall".Conservative recipe defaults: Most public recipes leave MoE overlap disabled. You need to explicitly enable it via overrides.
Verification#
Look for overlap-related log messages during initialization. The comm overlap
validation in comm_overlap.py will raise if prerequisites are not met, so a
clean startup confirms the feature is active.