NCCL API
The following sections describe the collective communications methods and operations.
Communicator Creation And Management Functions
The following functions are public APIs exposed by NVIDIA® Collective Communications Library ™ (NCCL) to create and manage the collective communication operations.
ncclGetUniqueId
The ncclGetUniqueId function generates an Id to be used in the ncclCommInitRank function.
ncclResult_t ncclGetUniqueId(ncclUniqueId* uniqueId);
ncclCommInitRank
ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank);
The ncclCommInitRank function implicitly synchronizes with other ranks, so it must be called by different threads and processes or use the ncclGroupStart and ncclGroupEnd functions.
Type | Argument Name | Description |
---|---|---|
ncclComm_t* | comm | Returned communicator. |
int | nranks | Number of ranks in the communicator. |
ncclUniqueId* | uniqueId | Pointer to a unique Id. |
int | rank | The rank associated to the current device. The rank must be between 0 and nranks-1 and unique within the communicator clique. |
ncclCommInitAll
ncclResult_t ncclCommInitAll(ncclComm_t* comm, int ndev, const int* devlist);
The ncclCommInitAll function returns an array of ndev newly initialized communicators in comm. The argument name comm, should be pre-allocated with the size of at least ndev*sizeof(ncclComm_t). If devlist is NULL, the first ndevCUDA devices are used. The order of devlist defines the user order of the devices within the communicator.
Type | Argument Name | Description |
---|---|---|
ncclComm_t* | comm | Returned array of communicators. The comm argument should be pre-allocated with a size of at least: ndev*sizeof(ncclComm_t). |
int | ndev | Number of ranks or devices in the communicator. |
const int* | devlist | A list of CUDA devices to associate with each rank. Should be an array of ndev integers. |
ncclCommDestroy
ncclResult_t ncclCommDestroy(ncclComm_t comm);
ncclCommCount
ncclResult_t ncclCommCount(const ncclComm_t comm, int* count);
ncclCommCuDevice
ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* device);
ncclCommUserRank
ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank);
Collective Communication Functions
The following NCCL APIs provide some commonly used collective operations.
ncclAllReduce
ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream);
Type | Argument Name | Description |
---|---|---|
const void* | sendbuff | Pointer to the data to read from. |
void* | recvbuff | Pointer to the data to write to. |
size_t | count | Number of elements to process. |
ncclDataType_t | datatype | Type of element. |
ncclRedOp_t | op | Operation to perform on each element. |
ncclComm_t | comm | Communicator object. |
cudaStream_t | stream | CUDA stream to run the operation on. |
ncclBroadcast
ncclResult_t ncclBroadcast(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);
ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);
Type | Argument Name | Description |
---|---|---|
const void* | sendbuff | Pointer to the data to read from. |
void* | recvbuff | Pointer to the data to read to. |
size_t | count | Number of elements to process. |
ncclDataType_t | datatype | Type of element. |
int | root | Rank of the root of the operation. |
ncclComm_t | comm | Communicator object. |
cudaStream_t | stream | CUDA stream to run the operation on. |
ncclReduce
ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream);
Type | Argument Name | Description |
---|---|---|
const void* | sendbuff | Pointer to the data to read from. |
void* | recvbuff | Pointer to the data to write to. |
size_t | count | Number of elements to process. |
ncclDataType_t | datatype | Type of element. |
ncclRedOp_t | op | Operation to perform on each element. |
int | root | Rank of the root of the operation. |
ncclComm_t | comm | Communicator object. |
cudaStream_t | stream | CUDA stream to run the operation on. |
ncclAllGather
ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream);
Type | Argument Name | Description |
---|---|---|
const void* | sendbuff | Pointer to the data to read from. |
void* | recvbuff | Pointer to the data to write to. This should be the size of sendcount*nranks. |
size_t | sendcount | Number of elements sent per rank. |
ncclDataType_t | datatype | Type of element. |
int | root | Rank of the root of the operation. |
ncclComm_t | comm | Communicator object. |
cudaStream_t | stream | CUDA stream to run the operation on. |
ncclReduceScatter
ncclResult_t ncclReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream);
Type | Argument Name | Description |
---|---|---|
const void* | sendbuff | Pointer to the data to read from. This should be the size of recvcount*nranks. |
void* | recvbuff | Pointer to the data to write to. |
size_t | recvcount | Number of elements to receive by each rank. |
ncclDataType_t | datatype | Type of element. |
ncclRedOp_t | op | Operation to perform on each element. |
ncclComm_t | comm | Communicator object. |
cudaStream_t | stream | CUDA stream to run the operation on. |
Group Calls
Group primitives define the behavior of the current thread to avoid blocking. They can therefore be used from multiple threads independently.
ncclGroupStart
The ncclGroupStart call starts a group call.
All subsequent calls to NCCL may not block due to inter-CPU synchronization.
ncclResult_t ncclGroupStart();
ncclGroupEnd
The ncclGroupEnd call ends a group call.
The ncclGroupEnd call returns when all operations since ncclGroupStart have been processed. This means communication primitives have been enqueued to the provided streams, but are not necessary complete. When used with ncclCommInitRank, it means all communicators have been initialized and are ready to be used.
ncclResult_t ncclGroupEnd();
Types
The following types are used by the CUDA library. These types are useful when configuring your collective operations.
ncclDataType_t
Data-Type | Description |
---|---|
ncclInt8, ncclChar | Signed 8-bits integer. |
ncclUint8 | Unsigned 8-bits integer. |
ncclInt32, ncclInt | Signed 32-bits integer. |
ncclUint32 | Unsigned 32-bits integer. |
ncclInt64 | Signed 64-bits integer. |
ncclUint64 | Unsigned 64-bits integer. |
ncclFloat16, ncclHalf | 16-bits floating point number (half precision) |
ncclFloat32, ncclFloat | 32-bits floating point number (single precision) |
ncclFloat64, ncclDouble | 64-bits floating point number (double precision) |
ncclRedOp_t
Reduction Operation | Description |
---|---|
ncclSum | Perform a sum (+) operation. |
ncclProd | Perform a product (*) operation. |
ncclMin | Perform a min operation. |
ncclMax | Perform a max operation. |
ncclResult_t
NCCL functions always return an error code of type ncclResult_t.
If the NCCL_DEBUG environment variable is set to WARN, whenever a function returns an error, NCCL should print the reason.
Return Code | Description |
---|---|
ncclSuccess | The operations completed successfully. |
ncclUnhandledCudaError | A call to CUDA returned a fatal error for the NCCL operation. |
ncclSystemError | A call to the system returned a fatal error for the NCCL operation. |
ncclInternalError | NCCL experienced an internal error. |
ncclInvalidArgument | The user has supplied an invalid argument. |
ncclInvalidUsage | The user has used NCCL in an invalid manner. |
Constants
NCCL defines two constants NCCL_MAJOR and NCCL_MINOR to help distinguish between API changes, in particular between NCCL 1.x and NCCL 2.x.