Thread SafetyΒΆ
NCCL primitives are generally not thread-safe, however, they are reentrant. Under multi-thread environment, it is not allowed to issue NCCL operations to a single communicator in parallel with multiple threads; it is not safe to issue NCCL operations in parallel to independent communicators located on the same device with multiple threads (see Using multiple NCCL communicators concurrently). If the child communicator shares the resources with the parent communicator (i.e., ncclConfig_t by splitShare), it is not allowed to issue NCCL operations to the child and parent communicators in parallel.
It is safe to operate a communicator from multiple threads as long as users can guarantee only one thread operates the communicator at a time. However, for any grouped NCCL operations, users need to ensure only one thread issues the all operations in the group.
For example, the following code provides a simple thread-safe example where threads are executed in sequence and only one thread is accessing the communicator at a time.
Thread 0:
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.blocking = 0;
cudaSetDevice(0);
ncclCommInitRankConfig(&comm, nranks, id, rank, &config);
ncclGroupStart();
ncclAllReduce(sendbuff0, recvbuff0, count0, datatype, redOp, comm, stream);
ncclAllReduce(sendbuff1, recvbuff1, count1, datatype, redOp, comm, stream);
ncclGroupEnd();
thread_exit();
Thread 1:
ncclResult_t state = ncclSuccess;
// wait for previous issued allreduce ops by Thread 0
do {
ncclCommGetAsyncError(comm, &state);
} while (state == ncclInProgress);
assert(state == ncclSuccess);
ncclAllReduce(sendbuff2, recvbuff2, count2, datatype, redOp, comm, stream);
do {
ncclCommGetAsyncError(comm, &state);
} while (state == ncclInProgress);
assert(state == ncclSuccess);
It is also valid to issue grouped NCCL operations from one thread and poll the status of each NCCL communicator with one thread as shown in the following code.
Thread 0:
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.blocking = 0;
ncclGroupStart();
for (int i = 0; i < nGpus; i++) {
cudaSetDevice(i);
ncclCommInitRankConfig(&comms[i], nranks, id, ranks[i], &config);
}
ncclGroupEnd();
Thread 0/1/2/3:
ncclResult_t state = ncclSuccess;
// wait for previous issued init ops by Thread 0
do {
ncclCommGetAsyncError(comms[thread_id], &state);
} while (state == ncclInProgress);
assert(state == ncclSuccess);
ncclAllReduce(sendbuff, recvbuff, count, datatype, redOp, comms[thread_id], stream);
do {
ncclCommGetAsyncError(comms[thread_id], &state);
} while (state == ncclInProgress);
assert(state == ncclSuccess);