Collective Communication Functions¶
The following NCCL APIs provide some commonly used collective operations.
ncclAllReduce¶
-
ncclResult_t
ncclAllReduce
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶ Reduces data arrays of length
count
insendbuff
using theop
operation and leaves identical copies of the result in eachrecvbuff
.In-place operation will happen if
sendbuff == recvbuff
.
Related links: AllReduce.
ncclBroadcast¶
-
ncclResult_t
ncclBroadcast
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Copies
count
elements fromsendbuff
on theroot
rank to all ranks’recvbuff
.sendbuff
is only used on rankroot
and ignored for other ranks.In-place operation will happen if
sendbuff == recvbuff
.
-
ncclResult_t
ncclBcast
(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Legacy in-place version of
ncclBroadcast
in a similar fashion to MPI_Bcast. A call toncclBcast(buff, count, datatype, root, comm, stream)
is equivalent to
ncclBroadcast(buff, buff, count, datatype, root, comm, stream)
Related links: Broadcast
ncclReduce¶
-
ncclResult_t
ncclReduce
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)¶ Reduce data arrays of length
count
insendbuff
intorecvbuff
on theroot
rank using theop
operation.recvbuff
is only used on rankroot
and ignored for other ranks.In-place operation will happen if
sendbuff == recvbuff
.
Related links: Reduce.
ncclAllGather¶
-
ncclResult_t
ncclAllGather
(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)¶ Gathers
sendcount
values from all GPUs and leaves identical copies of the result in eachrecvbuff
, receiving data from ranki
at offseti*sendcount
.Note: This assumes the receive count is equal to
nranks*sendcount
, which means thatrecvbuff
should have a size of at leastnranks*sendcount
elements.In-place operation will happen if
sendbuff == recvbuff + rank * sendcount
.
Related links: AllGather, In-place Operations.
ncclReduceScatter¶
-
ncclResult_t
ncclReduceScatter
(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶ Reduce data in
sendbuff
from all GPUs using theop
operation and leave the reduced result scattered over the devices so that therecvbuff
on ranki
will contain the i-th block of the result.Note: This assumes the send count is equal to
nranks*recvcount
, which means thatsendbuff
should have a size of at leastnranks*recvcount
elements.In-place operation will happen if
recvbuff == sendbuff + rank * recvcount
.
Related links: ReduceScatter, In-place Operations.