Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Activation Recomputation#
The input activations of network layers are stored in device memory and are used to compute gradients during back-propagation. When training a LLM with a long sequence length or a large micro-batch size, these input activations can quickly saturate device memory. Checkpointing a few activations and recomputing the rest is a common technique to reduce device memory usage.
Transformer Layer Recomputation#
NeMo supports transformer layer recomputation, which checkpoints the input of each transformer layer and recomputes the activations for the remaining layers. This technique significantly reduces activation memory usage. However, it increases the per-transformer layer computation cost by 30% due to re-executing the entire layer’s forward computation. NeMo also supports partial transformer layer recomputation, which is beneficial when recomputing a few transformer layers help to reduce enough GPU memory for model to fit. This approach avoids the need to recompute the rest of the layers.
Transformer layer recomputation is enabled by setting activations_checkpoint_granularity=full
.
The number of transformer layers to recompute can be set using activations_checkpoint_num_layers
along with activations_checkpoint_method=block
.
If you set activations_checkpoint_num_layers
as the total number of layers, the inputs of all transformer layers are checkpointed and recomputed.
When training with the pipeline parallelism, activations_checkpoint_num_layers
indicates the layers per pipeline stage.
When using virtual pipelining, activations_checkpoint_num_layers
specifies the number of layers per virtual pipeline stage.
NeMo also supports checkpointing the input to a block of multiple consecutive transformer layers, meaning that a block of transformer layers becomes the recomputation granularity. This approach can save activation memory but increases the recomputation buffer memory. Thus, it is only beneficial for memory savings when the model has many transformer layers or when the intermediate layers of a transformer layer hold relatively small activation stores.
This recomputation mode can be enabled by setting activations_checkpoint_method=uniform
, with the number of transformer layers per recomputation block set using activations_checkpoint_num_layers
.
Self-attention Recomputation#
NeMo supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations. This cost-efficient method achieves high memory savings with minimal recomputation cost. The intermediate layers of the self-attention block accounts for the majority of the activation memory. This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have the memory complexity of the sequence length square. However, their recomputation cost is relatively smaller than the other linear projection layers that are linear with the hidden size square.
Self-attention recomputation is hard-enabled when using FlashAttention, which is supported in Transformer Engine.
Also, you can use the self-attention recomputation without FlashAttention by setting activations_checkpoint_granularity=selective
.
Scheme of full and selective checkpointing granularity:
Scheme of uniform and block checkpointing method (full checkpointing granularity):