Parallelism Strategy Selection Skill#
For stable background on each parallelism type, see:
docs/parallelisms.mdcard.yaml(co-located)
Decision by Model Size#
Dense models#
Model size |
GPUs |
Recommended starting point |
|---|---|---|
< 1B |
1-8 |
DP only |
1-10B |
8-16 |
TP=2-4 + DP |
10-70B |
16-64 |
TP=4-8 + PP=2-4 + DP |
70-175B |
64-256 |
TP=8 + PP=4-8 + DP |
175-500B |
256-1024 |
TP=8 + PP=8-16 + CP=2 + DP |
MoE models#
MoE parallelism differs from dense models. Because only a fraction of parameters are active per token, TP can often stay at 1 or 2 — the active parameter shard already fits on a single GPU. EP is the primary scaling dimension, with PP handling cross-node layer distribution.
Model (total / active) |
TP |
PP |
EP |
Notes |
|---|---|---|---|---|
OLMoE 7B / 1B |
1 |
1 |
8 |
EP only, fits single node |
Moonlight 16B / 3B |
2 |
1 |
8 |
small TP for shared layers |
DeepSeek-V2 236B / 21B |
1 |
4 |
32 |
no TP at all |
GLM-4.5 Air 106B / 12B |
1 |
4 |
8 |
no TP at all |
Qwen3 30B-A3B |
4 |
2 |
4 |
|
GLM-4.5 355B / 32B |
2 |
8 |
16 |
|
Qwen3 235B-A22B |
4 |
16 |
8 |
CP=2 for pretrain |
DeepSeek-V3 671B / 37B |
2 |
16 |
64 |
TP=2, not 8 |
Kimi-K2 1T |
2 |
16 |
32 |
Key patterns:
TP is sized by active params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model.
EP scales with expert count. Common: EP = num_experts or num_experts / experts_per_gpu.
PP handles depth. Large MoE models use PP=8-16 across nodes.
ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4).
These are starting points, not hard rules. Always profile the first iteration to verify memory and communication.
Decision by Hardware Topology#
Single node with NVLink:
cfg.model.tensor_model_parallel_size = 8
Multiple nodes with InfiniBand:
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N
Limited network (Ethernet):
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M
The stable rule is: keep TP within a single NVLink domain. Use PP or DP for cross-node scaling. TP across nodes is almost always a performance loss.
Decision by Sequence Length#
Sequence length |
Recommendation |
|---|---|
< 2K |
standard TP + PP + DP |
2K-8K |
add SP ( |
8K-32K |
add CP=2 |
32K+ |
add CP=4-8, consider |
Combined Parallelism Enablement#
3D parallelism (TP + PP + DP):
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True
4D parallelism (TP + PP + CP + DP):
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True
MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False
MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True
DP size is always implicit:
data_parallel_size = world_size / (TP * PP * CP)
Memory Estimation#
Without parallelism (70B model, FP16):
parameters: 140 GB
gradients: 140 GB
optimizer states: 280 GB (Adam)
activations: 48 GB (batch=1, seq=4K)
total: 608 GB
With TP=4, PP=4, DP=4 (64 GPUs):
parameters: 8.75 GB per GPU
gradients: 8.75 GB per GPU
optimizer states: 17.50 GB per GPU
activations: 3.00 GB per GPU
total: ~38 GB per GPU
Code Anchors#
Parallelism dimensions set in model provider:
model_config = GPTModelProvider(
tensor_model_parallel_size=2,
# ... other model parameters
)
DP size calculation:
data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)
Bridge initialization wires parallelism into process groups:
parallel_state.initialize_model_parallel(
tensor_model_parallel_size=model_config.tensor_model_parallel_size,
pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
expert_model_parallel_size=model_config.expert_model_parallel_size,
...
)
Pitfalls#
TP across nodes destroys throughput. Always keep TP within a single NVLink domain.
PP without interleaving has large pipeline bubbles. Use
virtual_pipeline_model_parallel_sizewhen possible.SP requires
tensor_model_parallel_size > 1. Enabling SP alone without TP is a config error.CP requires
seq_length % (2 * context_parallel_size) == 0.EP is only for MoE models. Setting
expert_model_parallel_sizeon a dense model is a no-op or error.The model-size-to-parallelism table above is a starting heuristic. Always profile the first iteration to check memory and communication.
CUDA_DEVICE_MAX_CONNECTIONSand related env vars interact with overlap settings. Seeskills/perf-techniques/tp-dp-comm-overlap/SKILL.md.
Verification#
Quick sanity check that combined parallelism initializes correctly using the smallest available recipe with overridden parallelism:
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.tensor_model_parallel_size=2 \
model.pipeline_model_parallel_size=2 \
model.sequence_parallel=True \
train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
scheduler.lr_warmup_iters=0 \
validation.eval_iters=0 validation.eval_interval=0 \
checkpoint.save_interval=0 \
logger.log_interval=1
Success criteria:
exit code 0
finite loss at iteration 3 (e.g.
lm loss: 1.003808E+01)log shows TP=2 PP=2 DP=1 layout with 4 ranks