Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
CPU Offloading#
Overview#
CPU Offloading in NeMo is a feature that reduces the peak memory usage of the GPU by offloading activations and inactive weights to CPU storage. NeMo supports offloading at the transformer layer level, allowing users to specify the number of transformer layers in their language model that require CPU offloading. During the forward pass, NeMo offloads activations at the optimal time and reloads them as needed during the backward pass.
Features#
Supports training models with long sequence lengths by managing activation memory efficiently.
Enables high batch sizes per GPU by offloading activation memory.
Overlaps computation with data transfers (Host2Device and Device2Host) during offloading and reloading.
Usage#
Set cpu_offloading to True to enable CPU offloading.
Set cpu_offloading_num_layers to a value between 0 and the total number of layers in the model minus one.
Set cpu_offloading_activations and cpu_offloading_weights based on your needs to offload activations only, weights only, or both.