Using NCCL with CUDA Graphs¶
Starting with NCCL 2.9, NCCL operations can be captured by CUDA Graphs.
CUDA Graphs provide a way to define workflows as graphs rather than single operations. They may reduce overhead by launching multiple GPU operations through a single CPU operation. More details about CUDA Graphs can be found at https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs.
NCCL’s collective, P2P and group operations all support CUDA Graph captures. This support requires a minimum CUDA version of 11.3.
The following sample code shows how to capture computational kernels and NCCL operations in a CUDA Graph:
cudaGraph_t graph;
cudaStreamBeginCapture(stream);
kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
ncclAllreduce(..., stream);
kernel_C<<< ..., stream >>>(...);
cudaStreamEndCapture(stream, &graph);
cudaGraphExec_t instance;
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);
Note: if you are using NCCL in multi-thread mode, consideration can be given to adding the cudaStreamCaptureModeThreadLocal
flag to the cudaStreamBeginCapture
call.