Hybrid / Hierarchical Context Parallel#
This page covers the stable Bridge-facing meaning of hierarchical context
parallelism, especially the a2a+p2p transport path and
hierarchical_context_parallel_sizes.
For operational setup, code anchors, and verification commands, see skills/perf-techniques/hybrid-context-parallel/SKILL.md.
What It Is#
Context parallelism (CP) splits the input sequence across GPUs so each rank processes a chunk. The GPUs must communicate KV data during attention. There are several CP communication backends:
|
Mechanism |
Async / Overlap |
Constraint |
|---|---|---|---|
|
Ring-exchange of KV chunks |
Yes |
None |
|
All-gather full KV before attention |
No |
None |
|
All-to-all: scatter heads, gather full sequence (Ulysses-style) |
N/A |
CP <= num_kv_heads |
|
Hierarchical: a2a within inner group, p2p across outer group |
Partial (p2p part) |
Requires |
HCP (a2a+p2p) exists to scale CP beyond the KV head count by combining
a2a (fast, head-parallel) on intra-node links with p2p (async,
sequence-parallel) on inter-node links.
It is important to separate this from the upstream boolean
hybrid_context_parallel, which is a different feature for balancing packed or
variable-length workloads. The two concepts should not be treated as
interchangeable.
Why a2a is limited by KV heads#
a2a transposes the parallelism dimension: each rank trades its sequence chunk
for a subset of attention heads. After the all-to-all, every rank has the
full sequence but only heads / CP heads. This means:
heads / CPmust be a positive integer.The bottleneck is KV heads (not Q heads), because in GQA the KV heads are the indivisible unit.
If the model has 8 KV heads, pure a2a supports at most CP=8.
HCP breaks this limit by applying a2a only within a sub-group small enough to fit within the KV head count.
When to Use It#
Use HCP when ALL of these are true:
You need CP larger than
num_kv_heads / TP(pure a2a won’t fit).You cannot (or don’t want to) increase TP to shrink CP.
Your cluster has a clear bandwidth hierarchy (e.g., NVLink intra-node >> IB inter-node).
Prefer pure a2a when:
You can adjust TP so that
CP <= num_kv_heads / TP. This is simpler, avoids the p2p overhead, and often yields the same throughput with better memory headroom.
Prefer pure p2p when:
You have very few KV heads or want maximum CP flexibility.
Your workload can hide the p2p latency behind compute (long sequences help).
Decision example#
Model: 8 KV heads. Cluster: 4 nodes x 8 GPUs. Goal: train 128K sequences.
Option |
TP |
CP |
|
Notes |
|---|---|---|---|---|
A |
1 |
16 |
|
a2a intra-node (8 GPUs), p2p across 2 node-groups |
B |
2 |
4 |
|
CP=4 <= 8 KV heads. Simpler. Often same throughput. |
C |
1 |
16 |
|
Works but no a2a bandwidth benefit intra-node |
In practice, option B is usually preferred – benchmarks showed identical throughput to option A with more memory headroom.
It should be treated as an advanced feature rather than a default recommendation.
Stable Bridge Limitation#
The most important Bridge-specific limitation is that hierarchical context parallelism is currently supported only on the MPU initialization path.
In practice, that means:
dist.use_decentralized_pg=Falseis the supported Bridge paththe decentralized process-group path should not be assumed to materialize HCP groups
Stable Constraints#
The durable constraints are:
hierarchical_context_parallel_sizesmust matchcontext_parallel_sizemultiplicativelythe usual CP sequence-length divisibility rules still apply
Transformer Engine version support matters for
a2a+p2p
Recommendation Level#
Use hierarchical context parallelism in Bridge only when you intentionally want that transport path and are prepared to validate execution-path details. It is not yet the kind of feature that should be presented as universally safe across all Bridge initialization modes.