Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Context Parallelism

Context Parallelism (CP) is a method for parallelizing the processing of neural network activations across multiple GPUs by partitioning the input tensors along the sequence dimension. Unlike Sequence Parallelism (SP) that partitions the activations of specific layers, CP divides the activations of all layers.

CP is critical for training long context models, as it allows the model to handle longer sequences by distributing the sequence activations across multiple GPUs. This method reduces the memory footprint and computational cost of processing long sequences.

Enable Context Parallelism

To activate CP in the NeMo Framework, set the context_parallel_size parameter in the model configuration. This parameter specifies the number of GPUs across which the model’s sequence activations are distributed.

Set context_parallel_size to a value greater than 1 to enable sequence-wide model parallelism.

context_parallel_size: 1  # Example to enable Context Parallelism

The configuration can be found and modified here: NeMo Megatron Core Context Config.

Implement Context Parallelism

NeMo Framework leverages functionalities from both Megatron Core and Transformer Engine to implement CP efficiently. During forward propagation, each GPU handles a segment of the sequence, storing only the necessary Key and Value (KV) pairs. In the backward pass, these KV pairs are reassembled across GPUs using advanced communication schemes like all-gather and reduce-scatter transformed into point-to-point communications in a ring topology. This method reduces the memory footprint significantly while maintaining computational efficiency.

Visit our source code for more insights into the implementation:

Megatron Core wrappers for Transformer Engine.

Transformer Engine attention modules.