NVIDIA Collective Communications Library (NCCL) APIs
This section describes the Fortran interfaces to the NCCL library. The NCCL functions are only accessible from host code. Most of the runtime API routines, other than some utilities, are functions that return an error code; they return a value of ncclSuccess if the call was successful, or another value if there was an error. Unlike earlier Fortran modules, we have created a ncclResult derived type for the return values. We have also overloaded the .eq.
and .ne.
logical operators for testing the return status.
The NCCL interfaces and definitions described in this chapter can be exposed in host code by adding the line
use nccl
to your program unit.
Unless a specific kind is provided, the plain integer type used in the interfaces implies integer(4) and the plain real type implies real(4).
NCCL Definitions and Helper Functions
This section contains definitions and data types used in the NCCL library and interfaces to the NCCL communicator creation and management functions.
The Fortran NCCL module contains the following derived type definitions:
! Definitions from nccl.h
integer, parameter :: NCCL_MAJOR = 2
integer, parameter :: NCCL_MINOR = 19
integer, parameter :: NCCL_PATCH = 3
integer, parameter :: NCCL_VERSION = &
(NCCL_MAJOR * 10000 + NCCL_MINOR * 100 + NCCL_PATCH)
integer, parameter :: NCCL_SPLIT_NOCOLOR = -1
! Types from nccl.h
! ncclUniqueId
type, bind(c) :: ncclUniqueId
character(c_char) :: internal(NCCL_UNIQUE_ID_BYTES)
end type ncclUniqueId
! ncclComm
type, bind(c) :: ncclComm
type(c_ptr) :: member
end type ncclComm
! ncclResult
type, bind(c) :: ncclResult
integer(c_int) :: member
end type ncclResult
type(ncclResult), parameter :: &
ncclSuccess = ncclResult(0), &
ncclUnhandledCudaError = ncclResult(1), &
ncclSystemError = ncclResult(2), &
ncclInternalError = ncclResult(3), &
ncclInvalidArgument = ncclResult(4), &
ncclInvalidUsage = ncclResult(5), &
ncclRemoteError = ncclResult(6), &
ncclInProgress = ncclResult(7), &
ncclNumResults = ncclResult(8)
! ncclDataType
type, bind(c) :: ncclDataType
integer(c_int) :: member
end type ncclDataType
type(ncclDataType), parameter :: &
ncclInt8 = ncclDataType(0), &
ncclChar = ncclDataType(0), &
ncclUint8 = ncclDataType(1), &
ncclInt32 = ncclDataType(2), &
ncclInt = ncclDataType(2), &
ncclUint32 = ncclDataType(3), &
ncclInt64 = ncclDataType(4), &
ncclUint64 = ncclDataType(5), &
ncclFloat16 = ncclDataType(6), &
ncclHalf = ncclDataType(6), &
ncclFloat32 = ncclDataType(7), &
ncclFloat = ncclDataType(7), &
ncclFloat64 = ncclDataType(8), &
ncclDouble = ncclDataType(8), &
ncclNumTypes = ncclDataType(9)
! ncclRedOp
type, bind(c) :: ncclRedOp
integer(c_int) :: member
end type ncclRedOp
type(ncclRedOp), parameter :: &
ncclSum = ncclRedOp(0), &
ncclProd = ncclRedOp(1), &
ncclMax = ncclRedOp(2), &
ncclMin = ncclRedOp(3), &
ncclAvg = ncclRedOp(4), &
ncclNumOps = ncclRedOp(5)
! ncclConfig
type, bind(c) :: ncclConfig
integer(c_size_t) :: size = 48
integer(c_int) :: magic = z'cafebeef'
integer(c_int) :: version = NCCL_VERSION
integer(c_int) :: blocking = z'80000000'
integer(c_int) :: cgaClusterSize = z'80000000'
integer(c_int) :: minCTAs = z'80000000'
integer(c_int) :: maxCTAs = z'80000000'
type(c_ptr) :: netName = c_null_ptr
integer(c_int) :: splitShare = z'80000000'
end type ncclConfig
ncclGetVersion
This function returns the version number of the NCCL library.
type(ncclResult) function ncclGetVersion(version)
integer(4) :: version
ncclGetUniqueId
This function generates an ID to be used with ncclCommInitRank. This routine should be called once, and the generated ID should be distributed to all ranks.
type(ncclResult) function ncclGetUniqueId(uniqueId)
type(ncclUniqueId) :: uniqueId
ncclCommInitRank
This function generates a new NCCL communicator, of type(ncclCom). The rank argument must be between 0 and nranks-1. The uniqueId argument should be generated with ncclGetUniqueId.
type(ncclResult) function ncclCommInitRank(comm, nranks, uniqueId, rank)
type(ncclComm) :: comm
integer(4) :: nranks
type(ncclUniqueId) :: uniqueId
integer(4) :: rank
ncclCommInitRankConfig
This function generates a new NCCL communicator, of type(ncclCom), using a configuration argument. The rank argument must be between 0 and nranks-1. The uniqueId argument should be generated with ncclGetUniqueId. One variation of this interface will accept c_null_ptr
for the last argument.
type(ncclResult) function ncclCommInitRankConfig(comm, nranks, &
commId, rank, config)
type(ncclComm) :: comm
integer(4) :: nranks
type(ncclUniqueId) :: commId
integer(4) :: rank
type(ncclConfig) :: config
ncclCommInitAll
This function creates a single-process communicator clique, an array of type(ncclCom).
type(ncclResult) function ncclCommInitAll(comms, ndev, devlist)
type(ncclComm) :: comms(*)
integer(4) :: ndev
integer(4) :: devlist(*)
ncclCommDestroy
This function frees resources allocated to a NCCL communicator. It will wait for uncompleted operations.
type(ncclResult) function ncclCommDestroy(comm)
type(ncclComm) :: comm
ncclCommFinalize
This function finalizes the NCCL communicator object.
type(ncclResult) function ncclCommFinalize(comm)
type(ncclComm) :: comm
ncclCommSplit
This function creates a new NCCL communicator, of type(ncclCom), from an existing one, based on the input color and key. A variation of this interface will accept c_null_ptr
for the last argument.
type(ncclResult) function ncclCommSplit(comm, color, &
key, newcomm, config)
type(ncclComm) :: comm
integer(4) :: color, key
type(ncclComm) :: newcomm
type(ncclConfig) :: config
ncclCommAbort
This function frees resources allocated to a NCCL communicator. It will abort uncompleted operations.
type(ncclResult) function ncclCommAbort(comm)
type(ncclComm) :: comm
ncclCommRegister
This function registers a CUDA buffer, making it available for a zero-copy operation. The buffer can be of any type.
type(ncclResult) function ncclCommRegister(comm, buff, size, handle)
type(ncclComm) :: comm
real(4), device :: buff(*) ! Any type of device array
integer(8) :: size
type(c_ptr), intent(out) :: handle
ncclCommDeregister
This function deregisters a CUDA buffer used for a zero-copy operation.
type(ncclResult) function ncclCommDeregister(comm, handle)
type(ncclComm) :: comm
type(c_ptr), intent(in) :: handle
ncclGetErrorString
This function returns an error string for a given ncclResult value.
character*128 function ncclGetErrorString(ierr)
type(ncclResult) :: ierr
ncclGetLastError
This function returns a string for the last error that occurred.
character*128 function ncclGetLastError(comm)
type(ncclComm) :: comm
ncclCommGetAsyncError
This function queries whether the communicator has encountered any asynchronous errors.
type(ncclResult) function ncclCommGetAsyncError(comm, asyncError)
type(ncclComm) :: comm
type(ncclResult) :: asyncError
ncclCommCount
This function sets the count argument to the number of ranks in the NCCL communicator.
type(ncclResult) function ncclCommCount(comm, count)
type(ncclComm) :: comm
integer(4) :: count
ncclCommCuDevice
This function sets the device argument to the CUDA device associated with a NCCL communicator.
type(ncclResult) function ncclCommCuDevice(comm, device)
type(ncclComm) :: comm
integer(4) :: device
ncclCommUserRank
This function sets the rank argument to the rank within a NCCL communicator.
type(ncclResult) function ncclCommUserRank(comm, rank)
type(ncclComm) :: comm
integer(4) :: rank
NCCL Collective Communication Functions
This section contains interfaces for the NCCL functions that perform collective communication operations on device data. All functions can take either CUDA Fortran device arrays, OpenACC arrays within a host_data use_device data directive, or Fortran type(c_devptr) arguments.
ncclAllReduce
This function performs the specified reduction on data across devices and writes the results into the receive buffer of every rank.
type(ncclResult) function ncclAllReduce(sendbuff, recvbuff, &
count, datatype, op, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
! These combinations of sendbuff, recvbuff are also accepted:
! integer(4), device :: sendbuff(*), recvbuff(*)
! integer(8), device :: sendbuff(*), recvbuff(*)
! real(2), device :: sendbuff(*), recvbuff(*)
! real(4), device :: sendbuff(*), recvbuff(*)
! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
type(ncclRedOp) :: op
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
ncclBroadcast
This function copies the send buffer on the root rank to all other ranks in the NCCL communicator. An in-place operation will happen if sendbuff and recvbuff are the same address.
type(ncclResult) function ncclBroadcast(sendbuff, recvbuff, &
count, datatype, root, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
! These combinations of sendbuff, recvbuff are also accepted:
! integer(4), device :: sendbuff(*), recvbuff(*)
! integer(8), device :: sendbuff(*), recvbuff(*)
! real(2), device :: sendbuff(*), recvbuff(*)
! real(4), device :: sendbuff(*), recvbuff(*)
! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
integer(4) :: root
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
ncclReduce
This function performs the same operation as AllReduce, but writes the results only to the receive buffers of the specified root rank.
type(ncclResult) function ncclReduce(sendbuff, recvbuff, &
count, datatype, op, root, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
! These combinations of sendbuff, recvbuff are also accepted:
! integer(4), device :: sendbuff(*), recvbuff(*)
! integer(8), device :: sendbuff(*), recvbuff(*)
! real(2), device :: sendbuff(*), recvbuff(*)
! real(4), device :: sendbuff(*), recvbuff(*)
! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
type(ncclRedOp) :: op
integer(4) :: root
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
ncclAllGather
This function gathers the send buffers from each rank and stores them in rank order in the receive buffer of all ranks.
type(ncclResult) function ncclAllGather(sendbuff, recvbuff, &
sendcount, datatype, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
! These combinations of sendbuff, recvbuff are also accepted:
! integer(4), device :: sendbuff(*), recvbuff(*)
! integer(8), device :: sendbuff(*), recvbuff(*)
! real(2), device :: sendbuff(*), recvbuff(*)
! real(4), device :: sendbuff(*), recvbuff(*)
! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: sendcount
type(ncclDataType) :: datatype
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
ncclReduceScatter
This function performs the specified reduction on the data, and leaves the result scattered in equal blocks among the ranks, based on the rank index.
type(ncclResult) function ncclReduceScatter(sendbuff, recvbuff, &
recvcount, datatype, op, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
! These combinations of sendbuff, recvbuff are also accepted:
! integer(4), device :: sendbuff(*), recvbuff(*)
! integer(8), device :: sendbuff(*), recvbuff(*)
! real(2), device :: sendbuff(*), recvbuff(*)
! real(4), device :: sendbuff(*), recvbuff(*)
! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: recvcount
type(ncclDataType) :: datatype
type(ncclRedOp) :: op
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
NCCL Point To Point Communication Functions
This section contains interfaces for the NCCL functions that perform point to point communication operations on device data. All functions can take either CUDA Fortran device arrays, OpenACC arrays within a host_data use_device data directive, or Fortran type(c_devptr) arguments. The point to point operations were added in NCCL 2.7.
ncclSend
This function sends data from the send buffer to a communicator peer. This operation blocks the GPU. The receiving peer must call ncclRecv, with the same datatype and count.
type(ncclResult) function ncclSend(sendbuff, &
count, datatype, peer, comm, stream)
type(c_devptr) :: sendbuff
! These types for sendbuff are also accepted:
! integer(4), device :: sendbuff(*)
! integer(8), device :: sendbuff(*)
! real(2), device :: sendbuff(*)
! real(4), device :: sendbuff(*)
! real(8), device :: sendbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
integer(4) :: peer
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
ncclRecv
This function receives data from a communicator peer. This operation blocks the GPU. The sending peer must call ncclSend, with the same datatype and count.
type(ncclResult) function ncclRecv(recvbuff, &
count, datatype, peer, comm, stream)
type(c_devptr) :: recvbuff
! These types for recvbuff are also accepted:
! integer(4), device :: recvbuff(*)
! integer(8), device :: recvbuff(*)
! real(2), device :: recvbuff(*)
! real(4), device :: recvbuff(*)
! real(8), device :: recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
integer(4) :: peer
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream
NCCL Group Calls
This section contains interfaces for the NCCL functions that begin and end a group such that multiple calls can be merged.
ncclGroupStart
This function starts a group call. All subsequent calls to NCCL functions may not block due to inter-CPU synchronization.
type(ncclResult) function ncclGroupStart()
ncclGroupEnd
This function ends a group call. It returns when all operations since the corresponding call to ncclGroupStart have been processed, but not necessarily completed.
type(ncclResult) function ncclGroupEnd()