CPU Offloading#
References#
Stable docs:
docs/training/cpu-offloading.mdStructured metadata:
skills/perf-techniques/cpu-offloading/card.yaml
What It Is#
Two independent mechanisms to move data from GPU to CPU memory:
Mechanism |
Config namespace |
What gets offloaded |
PP restriction |
|---|---|---|---|
Activation offloading |
|
Activations (and optionally weights) per transformer layer |
PP must be 1 |
Optimizer offloading |
|
Adam optimizer states (momentum + variance) via |
None |
Quick Decision#
Situation |
Recommendation |
|---|---|
Large MoE model (30B+), needs PP > 1 |
Optimizer offloading — activation offloading is blocked by PP=1 |
Small/medium model, PP=1 fits, activation memory dominates |
Activation offloading |
Want tunable memory-speed tradeoff |
Optimizer offloading with fractional |
Throughput is top priority |
Don’t enable — offloading always adds overhead |
CUDA graphs are needed |
Only optimizer offloading — activation offloading is incompatible |
Memory pressure is moderate |
Optimizer offload at 25–50% fraction for best efficiency |
Enablement#
Optimizer CPU offloading (recommended for large models)#
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
CLI overrides:
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True
Activation CPU offloading (small/medium models only)#
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"
Config Parameter Reference#
Optimizer offloading#
Parameter |
Default |
Description |
|---|---|---|
|
|
Master switch |
|
|
Fraction of optimizer states on CPU (0.0–1.0) |
|
|
Overlap GPU↔CPU transfers with compute |
|
|
Use |
Activation offloading#
Parameter |
Default |
Description |
|---|---|---|
|
|
Master switch |
|
|
Number of transformer layers to offload (0 to num_layers-1) |
|
|
Offload activations |
|
|
Offload weights |
|
|
Double-buffer across layers while reloading |
Compatibility And Constraints#
Activation offloading#
pipeline_model_parallel_sizemust be 1recompute_granularitymust beNoneCannot combine with
fine_grained_activation_offloadingCannot combine with CUDA graphs
cpu_offloading_num_layersmust be in[0, num_layers-1)
Optimizer offloading#
Requires
use_distributed_optimizer = True(default in most recipes)No PP, recompute, or CUDA graph restrictions
optimizer_offload_fractionmust be in[0.0, 1.0]
Practical: large MoE models#
Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE models. The PP=1 constraint means each GPU holds all 48 layers; model weights + optimizer states alone (~70 GB) exceed H100 80 GB capacity.
Minimal Working Config#
Optimizer offload (50%, balanced)#
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5
Optimizer offload (100% + overlap, max savings)#
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
Activation offload (small model, PP=1)#
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
Weight offload only (small model, PP=1)#
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
Both activations and weights (small model, PP=1)#
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
Weight offloading and activation offloading share the same constraints (PP=1, no recompute, no CUDA graphs). Weight offloading has not been tested in the Qwen3-30B-A3B experiments — the measured results cover optimizer offloading only.
Minimal Runnable Command#
uv run python scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
train.train_iters=20 \
train.global_batch_size=8 \
train.micro_batch_size=1
Verification#
Unit tests#
uv run python -m pytest \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q
Success criteria#
Config validation passes for the selected offloading mode
Training completes without OOM or NCCL errors
Loss matches the non-offloaded baseline (max delta < 0.001)
Memory usage drops proportionally to offload fraction
Code Anchors#
MCore activation offload constraints#
if self.cpu_offloading and (
self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
):
raise ValueError(...)
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
if self.cpu_offloading and self.recompute_granularity is not None:
raise ValueError(
"CPU offloading does not work when activation recomputation is enabled"
)
MCore CUDA graph incompatibility#
if self.cpu_offloading:
raise ValueError("CUDA graphs not supported with CPU offloading.")
MCore fine-grained offloading mutual exclusion#
if self.fine_grained_activation_offloading:
assert (
not self.cpu_offloading
), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."
MCore HybridDeviceOptimizer instantiation#
if config.optimizer_cpu_offload:
# ... setup cpu/gpu optimizer classes ...
optimizer = HybridDeviceOptimizer(
param_groups,
offload_fraction=config.optimizer_offload_fraction,
cpu_optimizer_cls=cpu_optimizer_cls,
gpu_optimizer_cls=gpu_optimizer_cls,
overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
pin_cpu_grads=config.pin_cpu_grads,
pin_cpu_params=config.pin_cpu_params,
)
Bridge CUDA graph guard#
assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"
Bridge activation offloading in PEFT#
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_in(x)
x = self.activation(x)
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_out(x)
MCore model_parallel_config fields#
cpu_offloading: bool = False
cpu_offloading_num_layers: int = 0
cpu_offloading_activations: bool = True
cpu_offloading_weights: bool = False
cpu_offloading_double_buffering: bool = False
cpu_offloading_retain_pinned_cpu_buffers: bool = False
MCore optimizer offload config#
optimizer_cpu_offload: bool = False
optimizer_offload_fraction: float = 0.0
use_torch_optimizer_for_cpu_offload: bool = False
overlap_cpu_optimizer_d2h_h2d: bool = False
Failure Diagnosis#
Symptom |
Likely Cause |
How To Confirm |
Fix |
|---|---|---|---|
|
Activation offload + PP > 1 |
Check |
Set PP=1 or use optimizer offloading |
|
Activation offload + recompute |
Check |
Set |
|
Both offloading modes enabled |
Check both flags |
Use one or the other |
|
CUDA graphs + activation offload |
Check |
Set |
OOM with activation offloading |
Model too large for PP=1 |
Check allocated memory vs 80 GB |
Use optimizer offloading with PP > 1 |
Extreme slowdown (>4x) |
100% optimizer offload, CPU Adam bottleneck |
Compare iter time at different fractions |
Reduce fraction or enable |
OOM at partial optimizer offload |
Insufficient offload for this config |
Check memory at different fractions |
Increase fraction or add PP |
Known Limitations#
Activation offloading requires PP=1, making it impractical for large models (30B+ MoE) that need pipeline parallelism.
Optimizer offloading throughput penalty scales linearly (~1.9x at 25%, ~4.2x at 100% for Qwen3-30B-A3B).
D2H/H2D overlap provides only ~7% speedup because CPU Adam compute is the dominant bottleneck.
fine_grained_activation_offloadingis a separate module-level approach that works with PP > 1 but cannot be combined with layer-levelcpu_offloading.