Collective Communication Functions¶
The following NCCL APIs provide some commonly used collective operations.
ncclAllReduce¶
-
ncclResult_t
ncclAllReduce
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶ Reduces data arrays of length
count
insendbuff
using theop
operation and leaves identical copies of the result in eachrecvbuff
.In-place operation will happen if
sendbuff == recvbuff
.
Related links: AllReduce.
ncclBroadcast¶
-
ncclResult_t
ncclBroadcast
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Copies
count
elements fromsendbuff
on theroot
rank to all ranks’recvbuff
.sendbuff
is only used on rankroot
and ignored for other ranks.In-place operation will happen if
sendbuff == recvbuff
.
-
ncclResult_t
ncclBcast
(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Legacy in-place version of
ncclBroadcast
in a similar fashion to MPI_Bcast. A call toncclBcast(buff, count, datatype, root, comm, stream)
is equivalent to
ncclBroadcast(buff, buff, count, datatype, root, comm, stream)
Related links: Broadcast
ncclReduce¶
-
ncclResult_t
ncclReduce
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)¶ Reduce data arrays of length
count
insendbuff
intorecvbuff
on theroot
rank using theop
operation.recvbuff
is only used on rankroot
and ignored for other ranks.In-place operation will happen if
sendbuff == recvbuff
.
Related links: Reduce.
ncclAllGather¶
-
ncclResult_t
ncclAllGather
(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)¶ Gathers
sendcount
values from all GPUs and leaves identical copies of the result in eachrecvbuff
, receiving data from ranki
at offseti*sendcount
.Note: This assumes the receive count is equal to
nranks*sendcount
, which means thatrecvbuff
should have a size of at leastnranks*sendcount
elements.In-place operation will happen if
sendbuff == recvbuff + rank * sendcount
.
Related links: AllGather, In-place Operations.
ncclReduceScatter¶
-
ncclResult_t
ncclReduceScatter
(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶ Reduce data in
sendbuff
from all GPUs using theop
operation and leave the reduced result scattered over the devices so that therecvbuff
on ranki
will contain the i-th block of the result.Note: This assumes the send count is equal to
nranks*recvcount
, which means thatsendbuff
should have a size of at leastnranks*recvcount
elements.In-place operation will happen if
recvbuff == sendbuff + rank * recvcount
.
Related links: ReduceScatter, In-place Operations.
ncclAlltoAll¶
-
ncclResult_t
ncclAlltoAll
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)¶ Each rank sends
count
values to all other ranks and receivescount
values from all other ranks. Data to send to destination rankj
is taken fromsendbuff+j*count
and data received from source ranki
is placed atrecvbuff+i*count
.Note: This assumes the both total send and receive count is equal to
nranks*count
, which means thatsendbuff
andrecvbuff
should have a size of at leastnranks*count
elements.
Related links: AlltoAll.
ncclGather¶
-
ncclResult_t
ncclGather
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Each rank sends
count
elements fromsendbuff
to theroot
rank. On theroot
rank, data from ranki
is placed atrecvbuff + i*count
. On non-root ranks,recvbuff
is not used.Note: This assumes the receive count is equal to
nranks*count
, which means thatrecvbuff
should have a size of at leastnranks*count
elements.In-place operation will happen if
sendbuff == recvbuff + root * count
.
Related links: Gather.
ncclScatter¶
-
ncclResult_t
ncclScatter
(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶ Each rank receives
count
elements from theroot
rank. On theroot
rank,count
elements fromsendbuff + i*count
are sent to ranki
. On non-root ranks,sendbuff
is not used.Note: This assumes the send count is equal to
nranks*count
, which means thatsendbuff
should have a size of at leastnranks*count
elements.In-place operation will happen if
recvbuff == sendbuff + root * count
.
Related links: Scatter.