Collective Communication Functions¶
The following NCCL APIs provide some commonly used collective operations.
ncclAllReduce¶
-
ncclResult_t
ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶ Reduce data arrays of length
countinsendbuffusingopoperation and leaves identical copies of the result on eachrecvbuff.In-place operation will happen if
sendbuff == recvbuff.
Related links: AllReduce.
ncclBroadcast¶
-
ncclResult_t
ncclBroadcast(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Copies
countelements fromsendbuffon theroot` rank to all ranks' ``recvbuff.sendbuffis only used on rankrootand ignored for other ranks.In-place operation will happen if
sendbuff == recvbuff.
-
ncclResult_t
ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Legacy in-place version of
ncclBroadcastin a similar fashion to MPI_Bcast. A call toncclBcast(buff, count, datatype, root, comm, stream)
is equivalent to
ncclBroadcast(buff, buff, count, datatype, root, comm, stream)
Related links: Broadcast
ncclReduce¶
-
ncclResult_t
ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)¶ Reduce data arrays of length
countinsendbuffintorecvbuffon therootrank using theopoperation.recvbuffis only used on rankrootand ignored for other ranks.In-place operation will happen if
sendbuff == recvbuff.
Related links: Reduce.
ncclAllGather¶
-
ncclResult_t
ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)¶ Gather
sendcountvalues from all GPUs intorecvbuff, receiving data from rankiat offseti*sendcount.Note : This assumes the receive count is equal to
nranks*sendcount, which means thatrecvbuffshould have a size of at leastnranks*sendcountelements.In-place operation will happen if
sendbuff == recvbuff + rank * sendcount.
Related links: AllGather.
ncclReduceScatter¶
-
ncclResult_t
ncclReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶ Reduce data in
sendbufffrom all GPUs using theopoperation and leave the reduced result scattered over the devices so that therecvbuffon rankiwill contain the i-th block of the result.Note: This assumes the send count is equal to
nranks*recvcount, which means thatsendbuffshould have a size of at leastnranks*recvcountelements.
Related links: ReduceScatter.