Communication Overlap#

Communication overlap reduces exposed communication cost in distributed training by hiding collectives or point-to-point transfers under useful compute.

This page is the stable guide for what communication overlap is, when it tends to help, and which boundaries are durable across Megatron Bridge. For exact knobs, code anchors, and verification commands, see:

What It Is#

In Bridge, communication overlap is a family of related techniques rather than a single switch:

Mode	What gets hidden	Main gate
DP	gradient reduce-scatter and parameter all-gather	distributed-optimizer overlap path
TP	tensor-parallel collectives under layer compute	`CommOverlapConfig.tp_comm_overlap` plus sequence parallelism
PP	pipeline send/recv work under schedule execution	pipeline schedule and virtual pipeline layout
CP	context-parallel communication inside CP execution paths	CP implementation choice
EP	MoE token dispatch/combine communication under expert compute	`overlap_moe_expert_parallel_comm`

These paths share the same goal, but they do not share the same enablement rules, evidence level, or failure modes.

What Problem It Solves#

Distributed training often becomes communication-bound before it becomes compute-bound. Once TP, DP, PP, CP, or EP traffic is visible on the critical path, adding more GPUs may raise communication time faster than it raises useful compute.

Communication overlap addresses that by moving communication earlier or later in the step so the same transfer can happen while some other part of the model is already doing useful work. It does not change the training objective. It tries to reduce idle time.

Impacted Training Dimensions#

Dimension	Effect	Confidence	Why
`speed`	flat to moderately faster, mode-dependent	medium	The goal is to hide communication time, but gains depend strongly on which overlap mode is active and whether communication is actually exposed. Small-EP MoE runs can still be flat or even slower.
`memory`	usually near-neutral; some modes add modest buffers	low	Overlap itself is usually not a primary memory technique, although some implementations add buffer or scheduling constraints.
`scale`	positive at higher parallelism degrees	medium	Overlap becomes more valuable as communication dominates larger distributed runs.
`convergence`	no change expected	medium	The intent is to preserve the same training math, though schedule changes can alter floating-point accumulation order.
`stability`	adds operational constraints	medium	More overlap usually means tighter requirements around schedule shape, precision, runtime versions, and feature combinations.

When to Use It#

Enable communication overlap when all of the following are mostly true:

the distributed configuration already works correctly without overlap
communication is a meaningful part of step time
you are tuning throughput or utilization, not doing first bring-up
you can benchmark the specific overlap mode you plan to use

As a rule of thumb:

Mode	Good first use case	Recommendation
DP	distributed optimizer on multi-GPU or multi-node training	Usually worth considering early once optimizer sharding is already chosen.
TP	`TP >= 2` with sequence parallelism and TE-enabled path	Benchmark when TP collectives are visible in the profile.
PP	interleaved pipeline schedules where p2p overhead is visible	Treat as schedule tuning, not a blanket PP default.
CP	large-context runs already using CP	Follow the CP-specific guidance rather than treating it as a separate generic knob.
EP	large-scale MoE with many micro-batches and inter-node A2A cost	Most promising at larger EP and with higher-latency dispatcher backends.

Measured repo evidence today is strongest for MoE EP overlap. The pattern is mixed rather than universally positive:

small-EP alltoall runs can be correct but flat or slower
a 2026-05-18 current-main H100 x16 Qwen3 30B-A3B mock-data rerun with EP=16, alltoall, CUDA graphs disabled, and moe_permute_fusion=false reduced steady-state step time from 41.25s to 31.31s when enabling EP overlap
in that same run, adding delay_wgrad_compute was neutral: 31.20s versus 31.31s for overlap-only
delay_wgrad_compute can still help some schedules, but it is not a guaranteed speedup over overlap-only

So, in this repo, EP overlap is better described as correctness-backed and workload-sensitive rather than universally speedup-backed.

When Not to Use It#

Avoid communication overlap when any of these are true:

you are still debugging a new distributed setup
the profile is compute-bound rather than communication-bound
the required companion feature is missing, such as sequence parallelism for TP
another feature already imposes conflicting runtime constraints
you have not benchmarked the exact model and parallelism shape

For MoE EP overlap specifically, avoid treating it as a default when:

EP <= 4 with alltoall on <= 2 nodes
the run has very few pipeline micro-batches
moe_shared_expert_overlap must stay enabled
full recompute or recompute scheduling incompatible with EP overlap is required

Feature Interactions#

The most important interactions are:

DP overlap is tied to distributed-optimizer behavior rather than a fully independent tuning path.
TP overlap depends on sequence parallelism and the supported TE overlap path.
PP and EP overlap interact with virtual pipeline layout when PP > 1.
CP overlap should be reasoned about together with the chosen CP communication type.
EP overlap with DeepEP or HybridEP requires explicitly switching the dispatcher to flex.
EP overlap and moe_shared_expert_overlap are mutually exclusive.
CUDA graphs plus delay_wgrad_compute adds extra TE-version and graph-scope restrictions.
Launch-time environment tuning can conflict across overlap paths, especially TP or CP overlap versus DeepEP or HybridEP tuning.

Bridge Configuration#

Communication overlap is configured through CommOverlapConfig plus mode-specific model settings. There is no single universal toggle — DP, TP, PP, CP, and EP each have different prerequisites and should be enabled based on the actual bottleneck.

For config examples and minimal runnable commands, see:

Expected Metric Changes#

Metric	Expected Change	Conditions	Evidence
`step_time`	down	DP overlap with distributed optimizer on communication-heavy runs	expected
`step_time`	down	TP overlap with `TP >= 2`, sequence parallelism, and supported TE path	expected
`pipeline_idle_time`	down	interleaved PP where p2p cost is visible	expected
`step_time`	flat to mixed	small-EP MoE with `alltoall`	measured
`step_time`	down in repeated short EP16 H100 runs	Qwen3 30B-A3B, `alltoall`, CUDA graphs off, `moe_permute_fusion=false`	measured
`step_time`	neutral in repeated short EP16 H100 runs	same shape, EP overlap plus delayed wgrad versus EP overlap-only	measured

Do not assume one overlap win transfers automatically to another mode. The correct question is always “which communication path is exposed in this run?”

Common Failure Modes#

TP overlap silently disables itself when sequence parallelism is off or TP < 2.
PP overlap expectations are wrong when the schedule is non-interleaved or VPP is missing.
EP overlap asserts when PP > 1 but virtual_pipeline_model_parallel_size is unset.
EP overlap asserts when full recompute, recompute method, or shared-expert overlap stays enabled.
Setting moe_flex_dispatcher_backend alone does not activate DeepEP or HybridEP; the dispatcher must actually switch to flex.
Small-EP alltoall MoE runs can get slower because scheduling overhead is larger than the communication being hidden.
Fused MoE permutation may depend on the exact Transformer Engine and Triton stack. If a bring-up fails inside transformer_engine.pytorch.permutation, disable moe_permute_fusion for the smoke run and retest the fusion in a matched runtime before treating the timing as final.