Memory Tuning#
Stable docs: @docs/parallelisms.md Card: @skills/perf-memory-tuning/card.yaml
What It Is#
GPU OOM failures during training often stem from memory fragmentation rather than raw capacity. PyTorch’s default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes.
Beyond fragmentation, actual peak memory is determined by:
Parameter + optimizer state memory — controlled by TP, PP, DP sharding (distributed optimizer, FSDP)
Activation memory — controlled by activation recompute, sequence length, micro-batch size
Temporary / workspace memory — CUDA kernels, NCCL buffers, CUDA graphs
Quick Decision#
When a training run OOMs or is close to the memory limit:
Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truefirst. This fixes fragmentation-induced OOM with zero performance cost. Most Slurm launch templates already include it.Add selective activation recompute (
recompute_modules=[core_attn]) if not already enabled. See @skills/perf-activation-recompute/SKILL.md.Avoid increasing TP as a memory fix — doubling TP dramatically increases NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B).
Avoid increasing PP at the cost of DP — halving DP doubles gradient accumulation steps and hurts throughput (~6%).
Consider
mlprecompute if still OOM. Saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B).CPU offloading is blocked when PP > 1.
Enablement#
Expandable segments (recommended first step)#
Set in the job’s environment before launching:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
In Slurm scripts this is typically placed alongside other env vars:
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
No model config changes needed. Zero throughput cost.
Parallelism resizing#
If the model genuinely does not fit (not fragmentation), adjust parallelism:
Strategy |
Memory effect |
Throughput cost |
Notes |
|---|---|---|---|
Increase PP (keeping DP) |
Fewer layers per stage |
Moderate (~6% if DP halved) |
Only if GPU count allows |
Increase TP |
Fewer params per GPU |
Severe (-28% on 70B) |
Last resort |
Distributed optimizer |
Shards optimizer state across DP ranks |
~1-2% |
Recommended for large models |
FSDP |
Shards params + grads + optimizer |
Varies |
See @skills/perf-megatron-fsdp/SKILL.md |
Activation recompute#
See @skills/perf-activation-recompute/SKILL.md for full details.
CPU offloading#
cfg.model.cpu_offloading = True
Incompatible with PP > 1. Only usable when pipeline_model_parallel_size = 1.
A Note on VPP#
Virtual pipeline parallelism (VPP) is primarily a throughput optimization that reduces pipeline bubble overhead by interleaving smaller model chunks. Its effect on peak memory is minimal — changing VPP does not meaningfully change the total activation, parameter, or optimizer memory on a GPU.
In earlier experiments we incorrectly attributed an OOM fix to VPP tuning
(VPP 5→10). The actual fix was PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
which eliminated memory fragmentation. The VPP=10 run actually used slightly
more peak memory (60.2 GB vs 58.8 GB) but did not OOM because expandable
segments prevented fragmentation.
VPP should be tuned for pipeline bubble reduction (see @docs/parallelisms.md), not as a memory fix.
Compatibility and Constraints#
expandable_segments:Trueis incompatible with--use-nccl-ub(NCCL user-buffer registration). See Megatron-FSDP docs.When using CUDA graphs with
expandable_segments:True, setNCCL_GRAPH_REGISTER=0(required on pre-Blackwell GPUs, enforced by MCoreCudaGraphManager).CPU offloading requires
pipeline_model_parallel_size = 1.Distributed optimizer requires
use_distributed_optimizer = Truein the optimizer config.
Measured Results#
Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
Golden GPU utilization: 709.93 TFLOP/s/GPU
Regression threshold: 5%
Strategy comparison: parallelism changes for memory reduction#
Experiment |
TP |
PP |
VPP |
DP |
TFLOP/s/GPU |
vs Golden |
Peak Mem (GB) |
Result |
|---|---|---|---|---|---|---|---|---|
Baseline |
4 |
4 |
5 |
2 |
~704 |
-0.8% |
58.8 |
OOM (fragmentation) |
More PP |
4 |
8 |
5 |
1 |
668.0 |
-5.9% |
53.2 |
Borderline perf |
More TP |
8 |
4 |
5 |
1 |
508.7 |
-28.4% |
50.2 |
Severe regression |
Baseline + expandable_segments |
4 |
4 |
5 |
2 |
~704 |
-0.8% |
~59 |
Passed |
Key takeaways:
expandable_segments:Trueis the winner. The baseline OOM was caused by memory fragmentation, not insufficient capacity. Setting this env var eliminated the OOM with zero throughput cost and no parallelism changes.PP=8 works for memory but loses DP (2→1), meaning 32 gradient accumulation steps per batch, which hurts throughput by ~6%.
TP=8 is catastrophic (-28%) because doubling TP increases all-reduce communication volume proportionally across NVLink, and DP=1 means no micro-batch overlap.
CPU offloading: blocked#
Experiment |
offload_layers |
Result |
|---|---|---|
Exp 4 |
2 |
Incompatible (PP > 1) |
Exp 5 |
4 |
Incompatible (PP > 1) |
Exp 6 |
6 |
Incompatible (PP > 1) |
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading. This approach is blocked for any model using PP > 1.
Activation recompute: expensive alternative#
Selective activation recompute with mlp saved ~3 GB peak memory but cost
~16% GPU utilization on this workload. See
@skills/perf-activation-recompute/SKILL.md for full results.
Code Anchors#
CPU offloading PP incompatibility (MCore)#
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
VPP config and layer divisibility validation (MCore)#
if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
if (
not num_layers_per_middle_pipeline_rank
% self.virtual_pipeline_model_parallel_size
== 0
):
raise ValueError(
f"number of layers on each middle pipeline rank:"
f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
)
Parallelism docs on interleaved pipeline schedule#
To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:
model_config = GPTModelProvider(
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # 2 model chunks per pipeline stage
# ... other model parameters
)
Failure Diagnosis#
Symptom |
Cause |
Confirm |
Fix |
|---|---|---|---|
OOM on a single rank despite headroom on others |
Memory fragmentation |
check if |
set |
OOM with |
Genuine capacity limit |
check |
increase PP, use distributed optimizer, or add recompute |
|
using cpu_offloading with PP > 1 |
check PP config |
disable CPU offloading or set PP=1 |
|
NCCL UB incompatible with expandable allocator |
check env vars |
remove |
Known Limitations#
CPU offloading is blocked when PP > 1
Parallelism resizing (TP/PP) often has significant throughput costs
No automatic memory profiling to recommend the optimal strategy
Verification#
Quick check that expandable_segments:True is active:
import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
For Slurm jobs, verify the env var is exported before the training command in the launch script.