Communication Overlap#
Communication overlap reduces exposed communication cost in distributed training by overlapping collectives or point-to-point transfers with useful compute. Megatron Bridge supports overlap across several parallelism dimensions, but the available behavior is not identical for every mode.
This page is the stable overview for what communication overlap is, when to use it, and which constraints are durable. For operational setup, code anchors, and verification commands, see:
What It Is#
In Bridge, communication overlap spans several related subfeatures:
data-parallel overlap for gradient reduce-scatter and parameter all-gather
tensor-parallel overlap for TP communication under GEMM work
pipeline-parallel overlap for PP send and receive behavior
context-parallel overlap built into context-parallel execution paths
MoE expert-parallel overlap for expert token dispatch communication
These are related performance techniques, but they do not share the same gates, defaults, or operational risks.
When to Use It#
Communication overlap is a good fit when:
the model already needs TP, DP, PP, CP, or EP for scale
communication is a meaningful part of step time
correctness is already established and you are tuning for throughput
It is less appropriate when:
you are still bringing up a new training path and want minimal moving parts
the feature combination is branch-sensitive or weakly validated
launch-time environment tuning is likely to conflict with another technique
Stable Per-Mode Guidance#
Data Parallel#
DP overlap is tied to the distributed-optimizer path. It is the natural overlap mechanism for sharded optimizer-state training and should be reasoned about together with distributed optimizer behavior rather than as an isolated knob.
Tensor Parallel#
TP overlap is conceptually tied to sequence parallelism. If sequence parallelism is not available or not enabled, TP overlap should not be assumed to remain active.
Pipeline Parallel#
PP overlap is not a blanket property of all pipeline-parallel training. In practice, interleaved pipeline schedules are the most important positive case.
Context Parallel#
CP overlap is part of Bridge’s context-parallel execution model rather than a
separate standalone technique page. For hierarchical or a2a+p2p CP guidance,
see docs/training/hybrid-context-parallel.md.
MoE Expert Parallel#
MoE expert-parallel overlap hides the cost of token dispatch/combine all-to-all
communication by overlapping it with expert FFN compute. Optionally, delayed
expert weight-gradient computation (moe_delay_wgrad_compute) provides
additional overlap.
MoE overlap should be treated separately from generic TP, DP, and PP overlap.
Its constraints depend on dispatcher choice (alltoall or flex), expert
parallelism degree, precision (BF16/FP16), and runtime support. When pipeline
parallelism is used, virtual pipeline parallelism is required for the overlap
scheduling to interleave correctly.
Stable Constraints and Caveats#
The most durable caveats are:
Not all overlap modes are auto-enabled in the same situations.
Some overlap-related precision settings are owned by mixed-precision config, not by standalone overlap tuning alone.
Launch-time environment settings are part of the technique in practice, especially for TP, CP, and MoE overlap paths.
Recipe defaults are often conservative; feature existence does not imply that every public recipe enables the corresponding overlap path.
Recommendation Level#
Treat communication overlap as a tuning layer on top of a working distributed configuration, not as the first knob to reach for when basic correctness is still uncertain.
For most teams, the right order is:
establish a correct distributed configuration
choose the necessary parallelism strategy
enable or tune overlap for the specific communication bottleneck