Hybrid / Hierarchical Context Parallel#

This page covers the stable Bridge-facing meaning of hierarchical context parallelism, especially the a2a+p2p transport path and hierarchical_context_parallel_sizes.

For operational setup, code anchors, and verification commands, see skills/perf-techniques/hybrid-context-parallel/SKILL.md.

What It Is#

Context parallelism (CP) splits the input sequence across GPUs so each rank processes a chunk. The GPUs must communicate KV data during attention. There are several CP communication backends:

`cp_comm_type`	Mechanism	Async / Overlap	Constraint
`"p2p"`	Ring-exchange of KV chunks	Yes	None
`"all_gather"`	All-gather full KV before attention	No	None
`"a2a"`	All-to-all: scatter heads, gather full sequence (Ulysses-style)	N/A	CP <= num_kv_heads
`"a2a+p2p"`	Hierarchical: a2a within inner group, p2p across outer group	Partial (p2p part)	Requires `hierarchical_context_parallel_sizes`

HCP (a2a+p2p) exists to scale CP beyond the KV head count by combining a2a (fast, head-parallel) on intra-node links with p2p (async, sequence-parallel) on inter-node links.

It is important to separate this from the upstream boolean hybrid_context_parallel, which is a different feature for balancing packed or variable-length workloads. The two concepts should not be treated as interchangeable.

Why a2a is limited by KV heads#

a2a transposes the parallelism dimension: each rank trades its sequence chunk for a subset of attention heads. After the all-to-all, every rank has the full sequence but only heads / CP heads. This means:

heads / CP must be a positive integer.
The bottleneck is KV heads (not Q heads), because in GQA the KV heads are the indivisible unit.
If the model has 8 KV heads, pure a2a supports at most CP=8.

HCP breaks this limit by applying a2a only within a sub-group small enough to fit within the KV head count.

When to Use It#

Use HCP when ALL of these are true:

You need CP larger than num_kv_heads / TP (pure a2a won’t fit).
You cannot (or don’t want to) increase TP to shrink CP.
Your cluster has a clear bandwidth hierarchy (e.g., NVLink intra-node >> IB inter-node).

Prefer pure a2a when:

You can adjust TP so that CP <= num_kv_heads / TP. This is simpler, avoids the p2p overhead, and often yields the same throughput with better memory headroom.

Prefer pure p2p when:

You have very few KV heads or want maximum CP flexibility.
Your workload can hide the p2p latency behind compute (long sequences help).

Decision example#

Model: 8 KV heads. Cluster: 4 nodes x 8 GPUs. Goal: train 128K sequences.

Option	TP	CP	`cp_comm_type`	Notes
A	1	16	`a2a+p2p` with `[8,2]`	a2a intra-node (8 GPUs), p2p across 2 node-groups
B	2	4	`a2a`	CP=4 <= 8 KV heads. Simpler. Often same throughput.
C	1	16	`p2p`	Works but no a2a bandwidth benefit intra-node

In practice, option B is usually preferred – benchmarks showed identical throughput to option A with more memory headroom.

It should be treated as an advanced feature rather than a default recommendation.

Stable Bridge Limitation#

The most important Bridge-specific limitation is that hierarchical context parallelism is currently supported only on the MPU initialization path.

In practice, that means:

dist.use_decentralized_pg=False is the supported Bridge path
the decentralized process-group path should not be assumed to materialize HCP groups

Stable Constraints#

The durable constraints are:

hierarchical_context_parallel_sizes must match context_parallel_size multiplicatively
the usual CP sequence-length divisibility rules still apply
Transformer Engine version support matters for a2a+p2p

Recommendation Level#

Use hierarchical context parallelism in Bridge only when you intentionally want that transport path and are prepared to validate execution-path details. It is not yet the kind of feature that should be presented as universally safe across all Bridge initialization modes.