NCCL API :: Deep Learning SDK Documentation

NCCL API

The following sections describe the collective communications methods and operations.

Communicator Creation And Management Functions

The following functions are public APIs exposed by NVIDIA^® Collective Communications Library ™ (NCCL) to create and manage the collective communication operations.

ncclGetUniqueId

The ncclGetUniqueId function generates an Id to be used in the ncclCommInitRank function.

The ncclGetUniqueId function should be called once. The Id should be distributed to all of the ranks in the communicator before calling the ncclCommInitRank function.

ncclResult_t  ncclGetUniqueId(ncclUniqueId* uniqueId);

The following table lists the arguments that are passed to the ncclGetUniqueId function.

Type	Argument Name	Description
ncclUniqueId*	uniqueId	Pointer to an already allocated unique Id.

ncclCommInitRank

The ncclCommInitRank function creates a new communicator object for the current CUDA^® device. This function allows for multi-process initialization.

ncclResult_t  ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int
        rank);

The ncclCommInitRank function implicitly synchronizes with other ranks, so it must be called by different threads and processes or use the ncclGroupStart and ncclGroupEnd functions.

The following table lists the arguments that are passed to the ncclCommInitRank function.

Type	Argument Name	Description
ncclComm_t*	comm	Returned communicator.
int	nranks	Number of ranks in the communicator.
ncclUniqueId*	uniqueId	Pointer to a unique Id.
int	rank	The rank associated to the current device. The rank must be between `0` and `nranks-1` and unique within the communicator clique.

ncclCommInitAll

The ncclCommInitAll function creates a full communicator. For example, a clique of communicator objects. The communicator only works within a single process.

ncclResult_t  ncclCommInitAll(ncclComm_t* comm, int ndev, const int* devlist);

The ncclCommInitAll function returns an array of ndev newly initialized communicators in comm. The argument name comm, should be pre-allocated with the size of at least ndev*sizeof(ncclComm_t). If devlist is NULL, the first ndevCUDA devices are used. The order of devlist defines the user order of the devices within the communicator.

The following table lists the arguments that are passed to the ncclCommInitAll function.

Type	Argument Name	Description
ncclComm_t*	comm	Returned array of communicators. The comm argument should be pre-allocated with a size of at least: `ndev*sizeof(ncclComm_t)`.
int	ndev	Number of ranks or devices in the communicator.
const int*	devlist	A list of CUDA devices to associate with each rank. Should be an array of `ndev` integers.

ncclCommDestroy

The ncclCommDestroy function frees resources that are allocated to a communicator object.

ncclResult_t  ncclCommDestroy(ncclComm_t comm);

The following table lists the arguments that are passed to the ncclCommDestroy function.

Type	Argument Name	Description
ncclComm_t	comm	Communicator object to free.

ncclCommCount

The ncclCommCount function returns the number of ranks in a communicator.

ncclResult_t  ncclCommCount(const ncclComm_t comm, int* count);

The following table lists the arguments that are passed to the ncclCommCount function.

Type	Argument Name	Description
ncclComm_t	comm	Communicator object.
int*	count	Number of ranks returned.

ncclCommCuDevice

The ncclCommCuDevice function returns the CUDA device associated with a communicator object.

ncclResult_t  ncclCommCuDevice(const ncclComm_t comm, int* device);

The following table lists the arguments that are passed to the ncclCommCuDevice function.

Type	Argument Name	Description
ncclComm_t	comm	Communicator object.
int*	count	CUDA device returned.

ncclCommUserRank

The ncclCommUserRank function returns the rank of a communicator object.

ncclResult_t  ncclCommUserRank(const ncclComm_t comm, int* rank);

The following table lists the arguments that are passed to the ncclCommUserRank function.

Type	Argument Name	Description
ncclComm_t	comm	Communicator object.
int*	rank	Rank returned.

Collective Communication Functions

The following NCCL APIs provide some commonly used collective operations.

ncclAllReduce

The ncclAllReduce function reduces data arrays of length count in sendbuff using op operation and leaves identical copies of the result on each recvbuff.

ncclResult_t  ncclAllReduce(const void* sendbuff, void* recvbuff, size_t
        count,
     ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t
        stream);

The following table lists the arguments that are passed to the ncclAllReduce function.

Type	Argument Name	Description
const void*	sendbuff	Pointer to the data to read from.
void*	recvbuff	Pointer to the data to write to.
size_t	count	Number of elements to process.
ncclDataType_t	datatype	Type of element.
ncclRedOp_t	op	Operation to perform on each element.
ncclComm_t	comm	Communicator object.
cudaStream_t	stream	CUDA stream to run the operation on.

ncclBroadcast

The ncclBroadcast function copies the count values from the root rank to all ranks.

ncclResult_t  ncclBroadcast(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root,
    ncclComm_t comm, cudaStream_t stream);

The ncclBcast function is a legacy in-place version of ncclBroadcast in a similar fashion to MPI_Bcast. A call to ncclBcast (buff, count, datatype, root, comm, stream) is equivalent to ncclBroadcast (buff, count, datatype, root, comm, stream).

ncclResult_t  ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);

The following table lists the arguments that are passed to the ncclBroadcast function.

Type	Argument Name	Description
const void*	sendbuff	Pointer to the data to read from.
void*	recvbuff	Pointer to the data to read to.
size_t	count	Number of elements to process.
ncclDataType_t	datatype	Type of element.
int	root	Rank of the root of the operation.
ncclComm_t	comm	Communicator object.
cudaStream_t	stream	CUDA stream to run the operation on.

ncclReduce

The ncclReduce function reduces data arrays of length count in sendbuff into recvbuff using the op operation.

ncclResult_t  ncclReduce(const void* sendbuff, void* recvbuff, size_t count,
        ncclDataType_t datatype,
     ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream);

The following table lists the arguments that are passed to the ncclReduce function.

Type	Argument Name	Description
const void*	sendbuff	Pointer to the data to read from.
void*	recvbuff	Pointer to the data to write to.
size_t	count	Number of elements to process.
ncclDataType_t	datatype	Type of element.
ncclRedOp_t	op	Operation to perform on each element.
int	root	Rank of the root of the operation.
ncclComm_t	comm	Communicator object.
cudaStream_t	stream	CUDA stream to run the operation on.

ncclAllGather

The ncclAllGather function gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount.

Note: This assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements.

ncclResult_t  ncclAllGather(const void* sendbuff, void* recvbuff, size_t
        sendcount,
     ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream);

The following table lists the arguments that are passed to the ncclAllGather function.

Type	Argument Name	Description
const void*	sendbuff	Pointer to the data to read from.
void*	recvbuff	Pointer to the data to write to. This should be the size of `sendcount*nranks`.
size_t	sendcount	Number of elements sent per rank.
ncclDataType_t	datatype	Type of element.
int	root	Rank of the root of the operation.
ncclComm_t	comm	Communicator object.
cudaStream_t	stream	CUDA stream to run the operation on.

ncclReduceScatter

The ncclReduceScatter function reduces data in sendbuff using the op operation and leaves the reduced result scattered over the devices so that the recvbuff on rank i will contain the i-th block of the result.

Note: This assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements.

ncclResult_t  ncclReduceScatter(const void* sendbuff, void* recvbuff,
     size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm,
     cudaStream_t stream);

The following table lists the arguments that are passed to the ncclReduceScatter function.

Type	Argument Name	Description
const void*	sendbuff	Pointer to the data to read from. This should be the size of `recvcount*nranks`.
void*	recvbuff	Pointer to the data to write to.
size_t	recvcount	Number of elements to receive by each rank.
ncclDataType_t	datatype	Type of element.
ncclRedOp_t	op	Operation to perform on each element.
ncclComm_t	comm	Communicator object.
cudaStream_t	stream	CUDA stream to run the operation on.

Group Calls

Group primitives define the behavior of the current thread to avoid blocking. They can therefore be used from multiple threads independently.

ncclGroupStart

The ncclGroupStart call starts a group call.

All subsequent calls to NCCL may not block due to inter-CPU synchronization.

ncclResult_t ncclGroupStart();

ncclGroupEnd

The ncclGroupEnd call ends a group call.

The ncclGroupEnd call returns when all operations since ncclGroupStart have been processed. This means communication primitives have been enqueued to the provided streams, but are not necessary complete. When used with ncclCommInitRank, it means all communicators have been initialized and are ready to be used.

When the ncclGroupEnd call is used with the ncclCommInitRank function, the ncclGroupEnd call waits for all communicators to be initialized.

ncclResult_t ncclGroupEnd();

Types

The following types are used by the CUDA library. These types are useful when configuring your collective operations.

`ncclDataType_t`

NCCL defines the following integral and floating data-types.

Data-Type	Description
`ncclInt8, ncclChar`	Signed 8-bits integer.
`ncclUint8`	Unsigned 8-bits integer.
`ncclInt32, ncclInt`	Signed 32-bits integer.
`ncclUint32`	Unsigned 32-bits integer.
`ncclInt64`	Signed 64-bits integer.
`ncclUint64`	Unsigned 64-bits integer.
`ncclFloat16, ncclHalf`	16-bits floating point number (half precision)
`ncclFloat32, ncclFloat`	32-bits floating point number (single precision)
`ncclFloat64, ncclDouble`	64-bits floating point number (double precision)

`ncclRedOp_t`

NCCL defines the following reduction operations.

Reduction Operation	Description
`ncclSum`	Perform a sum (+) operation.
`ncclProd`	Perform a product (*) operation.
`ncclMin`	Perform a min operation.
`ncclMax`	Perform a max operation.

`ncclResult_t`

NCCL functions always return an error code of type ncclResult_t.

If the NCCL_DEBUG environment variable is set to WARN, whenever a function returns an error, NCCL should print the reason.

Return Code	Description
`ncclSuccess`	The operations completed successfully.
`ncclUnhandledCudaError`	A call to CUDA returned a fatal error for the NCCL operation.
`ncclSystemError`	A call to the system returned a fatal error for the NCCL operation.
`ncclInternalError`	NCCL experienced an internal error.
`ncclInvalidArgument`	The user has supplied an invalid argument.
`ncclInvalidUsage`	The user has used NCCL in an invalid manner.

Constants

NCCL defines two constants NCCL_MAJOR and NCCL_MINOR to help distinguish between API changes, in particular between NCCL 1.x and NCCL 2.x.