Using NCCL with CUDA Graphs

Starting with NCCL 2.9, NCCL operations can be captured by CUDA Graphs.

CUDA Graphs provide a way to define workflows as graphs rather than single operations. They may reduce overhead by launching multiple GPU operations through a single CPU operation. More details about CUDA Graphs can be found in the CUDA Programming Guide.

NCCL’s collective, P2P and group operations all support CUDA Graph captures. This support requires a minimum CUDA version of 11.3.

The following sample code shows how to capture computational kernels and NCCL operations in a CUDA Graph:

cudaGraph_t graph;
cudaStreamBeginCapture(stream);
kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
ncclAllreduce(..., stream);
kernel_C<<< ..., stream >>>(...);
cudaStreamEndCapture(stream, &graph);

cudaGraphExec_t instance;
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);

Capture modes

By default, CUDA stream capture uses the cudaStreamCaptureModeGlobal mode if no flag is given to the cudaStreamBeginCapture call. This mode is compatible with NCCL except two scenarios:

  1. If you are using NCCL in multi-thread mode, i.e. a process has multiple threads each of which is attached to a different GPU, then you would need to add the cudaStreamCaptureModeThreadLocal flag to the cudaStreamBeginCapture call.
  2. If you are capturing NCCL P2P calls (ncclSend and ncclRecv) without any previous P2P calls to the same peer(s), you would also need to use the cudaStreamCaptureModeThreadLocal mode.

A comparison between cudaStreamCaptureModeGlobal and cudaStreamCaptureModeThreadLocal can be found here.

User buffer registration

Starting with NCCL 2.11, NCCL will register user buffers to improve communication performance if the communication calls are captured by CUDA Graphs. This is enabled by the CUDA Graph feature that the same set of user input and output buffers will be used in all graph replays – however many of them. The buffer registration allows a NCCL rank to directly access the user buffers of a different process, eliminating the need for extra copy, thus improving the communication performance and reducing memory bandwidth consumption.

In 2.11, the CUDA Graph based user buffer registration is effective only when: (i) the CollNet algorithm is being used; (ii) all GPUs within a node have P2P access to each other; (iii) there is at most one GPU per process.

Users will also need to set the environment variable NCCL_GRAPH_REGISTER to 1 to enable this feature. For more details, please see NCCL_GRAPH_REGISTER. If the above conditions are not true, or NCCL_GRAPH_REGISTER is not set to 1, the registration will not occur.

The user buffers will be automatically de-registered when the CUDA Graphs are destroyed.