************* Thread Safety ************* NCCL primitives are generally not thread-safe, however, they are reentrant. Under multi-thread environment, it is not allowed to issue NCCL operations to a single communicator in parallel with multiple threads; it is not safe to issue NCCL operations in parallel to independent communicators located on the same device with multiple threads (see :ref:`multi-thread-concurrent-usage`). If the child communicator shares the resources with the parent communicator (i.e., :ref:`ncclconfig` by `splitShare`), it is not allowed to issue NCCL operations to the child and parent communicators in parallel. It is safe to operate a communicator from multiple threads as long as users can guarantee only one thread operates the communicator at a time. However, for any grouped NCCL operations, users need to ensure only one thread issues the all operations in the group. For example, the following code provides a simple thread-safe example where threads are executed in sequence and only one thread is accessing the communicator at a time. .. code:: C Thread 0: ncclConfig_t config = NCCL_CONFIG_INITIALIZER; config.blocking = 0; cudaSetDevice(0); ncclCommInitRankConfig(&comm, nranks, id, rank, &config); ncclGroupStart(); ncclAllReduce(sendbuff0, recvbuff0, count0, datatype, redOp, comm, stream); ncclAllReduce(sendbuff1, recvbuff1, count1, datatype, redOp, comm, stream); ncclGroupEnd(); thread_exit(); Thread 1: ncclResult_t state = ncclSuccess; // wait for previous issued allreduce ops by Thread 0 do { ncclCommGetAsyncError(comm, &state); } while (state == ncclInProgress); assert(state == ncclSuccess); ncclAllReduce(sendbuff2, recvbuff2, count2, datatype, redOp, comm, stream); do { ncclCommGetAsyncError(comm, &state); } while (state == ncclInProgress); assert(state == ncclSuccess); It is also valid to issue grouped NCCL operations from one thread and poll the status of each NCCL communicator with one thread as shown in the following code. .. code:: C Thread 0: ncclConfig_t config = NCCL_CONFIG_INITIALIZER; config.blocking = 0; ncclGroupStart(); for (int i = 0; i < nGpus; i++) { cudaSetDevice(i); ncclCommInitRankConfig(&comms[i], nranks, id, ranks[i], &config); } ncclGroupEnd(); Thread 0/1/2/3: ncclResult_t state = ncclSuccess; // wait for previous issued init ops by Thread 0 do { ncclCommGetAsyncError(comms[thread_id], &state); } while (state == ncclInProgress); assert(state == ncclSuccess); ncclAllReduce(sendbuff, recvbuff, count, datatype, redOp, comms[thread_id], stream); do { ncclCommGetAsyncError(comms[thread_id], &state); } while (state == ncclInProgress); assert(state == ncclSuccess);