Point-to-point communication

(Since NCCL 2.7) Point-to-point communication can be used to express any communication pattern between ranks. Any point-to-point communication needs two NCCL calls : a call to ncclSend() on one rank and a corresponding ncclRecv() on the other rank, with the same count and data type.

Multiple calls to ncclSend() and ncclRecv() targeting different peers can be fused together with ncclGroupStart() and ncclGroupEnd() to form more complex communication patterns such as one-to-all (scatter), all-to-one (gather), all-to-all or communication with neighbors in an N-dimensional space.

Point-to-point calls within a group will be blocking until that group of calls completes, but calls within a group can be seen as progressing independently, hence should never block each other. It is therefore important to merge calls that need to progress concurrently to avoid deadlocks.

Below are a few examples of classic point-to-point communication patterns used by parallel applications. NCCL semantics allow for all variants with different sizes, datatypes, and buffers, per rank.

Sendrecv

In MPI terms, a sendrecv operation is when two ranks exchange data, both sending and receiving at the same time. This can be done by merging both ncclSend and ncclRecv calls into one :

ncclGroupStart();
ncclSend(sendbuff, sendcount, sendtype, peer, comm, stream);
ncclRecv(recvbuff, recvcount, recvtype, peer, comm, stream);
ncclGroupEnd();

One-to-all (scatter)

A one-to-all operation from a root rank can be expressed by merging all send and receive operations in a group :

ncclGroupStart();
if (rank == root) {
  for (int r=0; r<nranks; r++)
    ncclSend(sendbuff[r], size, type, r, comm, stream);
}
ncclRecv(recvbuff, size, type, root, comm, stream);
ncclGroupEnd();

All-to-one (gather)

Similarly, an all-to-one operations to a root rank would be implemented this way :

ncclGroupStart();
if (rank == root) {
  for (int r=0; r<nranks; r++)
    ncclRecv(recvbuff[r], size, type, r, comm, stream);
}
ncclSend(sendbuff, size, type, root, comm, stream);
ncclGroupEnd();

All-to-all

An all-to-all operation would be a merged loop of send/recv operations to/from all peers :

ncclGroupStart();
for (int r=0; r<nranks; r++) {
  ncclSend(sendbuff[r], sendcount, sendtype, r, comm, stream);
  ncclRecv(recvbuff[r], recvcount, recvtype, r, comm, stream);
}
ncclGroupEnd();

Neighbor exchange

Finally, exchanging data with neighbors in an N-dimensions space could be done with :

ncclGroupStart();
for (int d=0; d<ndims; d++) {
  ncclSend(sendbuff[d], sendcount, sendtype, next[d], comm, stream);
  ncclRecv(recvbuff[d], recvcount, recvtype, prev[d], comm, stream);
}
ncclGroupEnd();