CPU Offloading#

CPU offloading reduces per-GPU memory by moving data to host (CPU) memory during training, trading throughput for the ability to train models or configurations that would otherwise not fit in GPU memory.

For operational setup, code anchors, and verification commands, see skills/nemo-mbridge-perf-cpu-offloading/SKILL.md.

What It Is#

Megatron Bridge supports two independent CPU offloading mechanisms:

Mechanism	What gets offloaded	Implementation
Activation offloading	Activations (and optionally weights) per transformer layer	MCore `cpu_offloading_context` in transformer block
Optimizer offloading	Optimizer states (Adam momentum + variance)	MCore `HybridDeviceOptimizer` with configurable GPU/CPU split

Activation offloading moves layer activations to CPU during forward and reloads them during backward. Optimizer offloading keeps a configurable fraction of Adam optimizer states on CPU and runs the optimizer step there.

These are independent features addressing different memory pools. They can be used separately but not always together due to constraint conflicts.

What Problem It Solves#

Large models, especially MoE architectures, can exhaust GPU memory even with standard parallelism techniques (TP, PP, EP). The two offloading mechanisms target different bottlenecks:

Activation offloading helps when activation memory dominates — common with long sequences, large batch sizes, or when recomputation is disabled.
Optimizer offloading helps when optimizer state memory dominates — Adam keeps two state tensors (momentum + variance) per parameter, doubling the parameter memory footprint. For a 30B MoE model this can be 15+ GB per GPU.

Impacted Training Dimensions#

Dimension	Effect	Confidence	Rationale
Speed	1.9x–4.2x slower step time (scales linearly with offload fraction)	high	CPU Adam compute and D2H/H2D transfers add latency. Measured on Qwen3-30B-A3B TP2 PP2 EP4. D2H/H2D overlap reduces 100% penalty from 4.2x to 3.9x.
Memory	3.8 GB saved per 25% of optimizer offload fraction (up to 15.3 GB / 32% at 100%)	high	Measured on Qwen3-30B-A3B (47.2 GB baseline). Activation offload saves proportional to layers offloaded.
Scale	enables otherwise-OOM configurations	medium	Can free memory for larger batch sizes or additional parallelism.
Convergence	no change (loss delta < 0.001 across all fractions)	high	All optimizer offload fractions (25–100%) produce identical loss across 20 iterations.
Stability	no issues observed	high	No errors, hangs, or NCCL issues across 120 total iterations tested (6 configurations).

D2H (device-to-host) and H2D (host-to-device) refer to data transfers between GPU and CPU memory. Each optimizer step copies gradients to CPU (D2H), runs Adam on CPU, then copies updated parameters back (H2D). The overlap_cpu_optimizer_d2h_h2d flag overlaps these transfers with compute. On Qwen3-30B-A3B MoE this provided only ~7% speedup because CPU-side Adam compute — not the transfers — was the dominant bottleneck. Other models with different parameter counts or optimizer configurations may see different transfer-to-compute ratios.

When to Use It#

GPU memory is tight and throughput regression is acceptable
The model requires PP > 1 to fit — use optimizer offloading (activation offloading requires PP=1)
You want a tunable memory-speed tradeoff via optimizer_offload_fraction
Activation memory is the bottleneck and the model fits with PP=1 and no recompute — use activation offloading

When Not to Use It#

Throughput is the primary concern — offloading always adds overhead
The model already fits comfortably in GPU memory
CUDA graphs are enabled — activation offloading is incompatible
The model is large (30B+ MoE) and requires PP > 1 — activation offloading is blocked by the PP=1 constraint
Alternative memory techniques (FSDP, activation recomputation) provide sufficient savings without the throughput penalty

Feature Interactions#

Feature	Interaction	Details
Pipeline parallelism (PP > 1)	Blocks activation offloading	Hard MCore constraint. Use optimizer offloading instead.
Activation recomputation	Blocks activation offloading	Hard MCore constraint. Cannot combine.
CUDA graphs	Blocks activation offloading	Hard MCore constraint. Optimizer offloading is unaffected.
Fine-grained activation offloading	Mutual exclusion with layer-level activation offloading	Use one or the other. Fine-grained works with PP > 1.
Distributed optimizer	Required for optimizer offloading	`use_distributed_optimizer=True` (default in most recipes).
Megatron FSDP	Alternative	Shards parameters across DP ranks. Different tradeoff profile.
Expert parallelism	Compatible	Both offloading mechanisms work with EP.

Bridge Configuration#

CPU offloading is configured through two independent config namespaces:

Optimizer offloading: optimizer.optimizer_cpu_offload, optimizer.optimizer_offload_fraction, and optimizer.overlap_cpu_optimizer_d2h_h2d
Activation offloading: model.cpu_offloading, model.cpu_offloading_num_layers, and related model.cpu_offloading_* fields

For config examples, parameter tables, and runnable commands, see skills/nemo-mbridge-perf-cpu-offloading/SKILL.md.

Common Failure Modes#

Symptom	Cause	Fix
`Currently there is no support for Pipeline parallelism with CPU offloading`	Activation offload with PP > 1	Set PP=1 or switch to optimizer offloading
`CPU offloading does not work when activation recomputation is enabled`	Activation offload with recompute enabled	Set `recompute_granularity=null`
`CUDA graphs not supported with CPU offloading`	Activation offload with CUDA graphs	Set `cuda_graph_impl="none"`
`fine_grained_activation_offloading cannot be enabled with cpu_offloading`	Both offloading types enabled	Use one or the other
OOM with activation offloading on large model	Model too large for PP=1	Switch to optimizer offloading (works with PP > 1)
>4x throughput regression	100% optimizer offload, CPU Adam bottleneck	Reduce fraction or enable `overlap_cpu_optimizer_d2h_h2d`