NVIDIA Collective Communications Library (NCCL) APIs

This section describes the Fortran interfaces to the NCCL library. The NCCL functions are only accessible from host code. Most of the runtime API routines, other than some utilities, are functions that return an error code; they return a value of ncclSuccess if the call was successful, or another value if there was an error. Unlike earlier Fortran modules, we have created a ncclResult derived type for the return values. We have also overloaded the .eq. and .ne. logical operators for testing the return status.

The NCCL interfaces and definitions described in this chapter can be exposed in host code by adding the line

use nccl

to your program unit.

Unless a specific kind is provided, the plain integer type used in the interfaces implies integer(4) and the plain real type implies real(4).

NCCL Definitions and Helper Functions

This section contains definitions and data types used in the NCCL library and interfaces to the NCCL communicator creation and management functions.

The Fortran NCCL module contains the following derived type definitions:

! Definitions from nccl.h
integer, parameter :: NCCL_MAJOR = 2
integer, parameter :: NCCL_MINOR = 19
integer, parameter :: NCCL_PATCH = 3
integer, parameter :: NCCL_VERSION = &
    (NCCL_MAJOR * 10000 + NCCL_MINOR * 100 + NCCL_PATCH)

integer, parameter :: NCCL_SPLIT_NOCOLOR = -1
! Types from nccl.h
! ncclUniqueId
type, bind(c) :: ncclUniqueId
  character(c_char) :: internal(NCCL_UNIQUE_ID_BYTES)
end type ncclUniqueId

! ncclComm
type, bind(c) :: ncclComm
  type(c_ptr) :: member
end type ncclComm

! ncclResult
type, bind(c) :: ncclResult
  integer(c_int) :: member
end type ncclResult

type(ncclResult), parameter :: &
    ncclSuccess             = ncclResult(0), &
    ncclUnhandledCudaError  = ncclResult(1), &
    ncclSystemError         = ncclResult(2), &
    ncclInternalError       = ncclResult(3), &
    ncclInvalidArgument     = ncclResult(4), &
    ncclInvalidUsage        = ncclResult(5), &
    ncclRemoteError         = ncclResult(6), &
    ncclInProgress          = ncclResult(7), &
    ncclNumResults          = ncclResult(8)
! ncclDataType
type, bind(c) :: ncclDataType
  integer(c_int) :: member
end type ncclDataType

type(ncclDataType), parameter :: &
    ncclInt8     = ncclDataType(0), &
    ncclChar     = ncclDataType(0), &
    ncclUint8    = ncclDataType(1), &
    ncclInt32    = ncclDataType(2), &
    ncclInt      = ncclDataType(2), &
    ncclUint32   = ncclDataType(3), &
    ncclInt64    = ncclDataType(4), &
    ncclUint64   = ncclDataType(5), &
    ncclFloat16  = ncclDataType(6), &
    ncclHalf     = ncclDataType(6), &
    ncclFloat32  = ncclDataType(7), &
    ncclFloat    = ncclDataType(7), &
    ncclFloat64  = ncclDataType(8), &
    ncclDouble   = ncclDataType(8), &
    ncclNumTypes = ncclDataType(9)
! ncclRedOp
type, bind(c) :: ncclRedOp
  integer(c_int) :: member
end type ncclRedOp

type(ncclRedOp), parameter :: &
    ncclSum    = ncclRedOp(0), &
    ncclProd   = ncclRedOp(1), &
    ncclMax    = ncclRedOp(2), &
    ncclMin    = ncclRedOp(3), &
    ncclAvg    = ncclRedOp(4), &
    ncclNumOps = ncclRedOp(5)
! ncclConfig
type, bind(c) :: ncclConfig
  integer(c_size_t) :: size = 48
  integer(c_int) :: magic = z'cafebeef'
  integer(c_int) :: version = NCCL_VERSION
  integer(c_int) :: blocking = z'80000000'
  integer(c_int) :: cgaClusterSize = z'80000000'
  integer(c_int) :: minCTAs = z'80000000'
  integer(c_int) :: maxCTAs = z'80000000'
  type(c_ptr)    :: netName = c_null_ptr
  integer(c_int) :: splitShare = z'80000000'
end type ncclConfig

ncclGetVersion

This function returns the version number of the NCCL library.

type(ncclResult) function ncclGetVersion(version)
integer(4) :: version

ncclGetUniqueId

This function generates an ID to be used with ncclCommInitRank. This routine should be called once, and the generated ID should be distributed to all ranks.

type(ncclResult) function ncclGetUniqueId(uniqueId)
type(ncclUniqueId) :: uniqueId

ncclCommInitRank

This function generates a new NCCL communicator, of type(ncclCom). The rank argument must be between 0 and nranks-1. The uniqueId argument should be generated with ncclGetUniqueId.

type(ncclResult) function ncclCommInitRank(comm, nranks, uniqueId, rank)
type(ncclComm) :: comm
integer(4) :: nranks
type(ncclUniqueId) :: uniqueId
integer(4) :: rank

ncclCommInitRankConfig

This function generates a new NCCL communicator, of type(ncclCom), using a configuration argument. The rank argument must be between 0 and nranks-1. The uniqueId argument should be generated with ncclGetUniqueId. One variation of this interface will accept c_null_ptr for the last argument.

type(ncclResult) function ncclCommInitRankConfig(comm, nranks, &
        commId, rank, config)
type(ncclComm) :: comm
integer(4) :: nranks
type(ncclUniqueId) :: commId
integer(4) :: rank
type(ncclConfig) :: config

ncclCommInitAll

This function creates a single-process communicator clique, an array of type(ncclCom).

type(ncclResult) function ncclCommInitAll(comms, ndev, devlist)
type(ncclComm) :: comms(*)
integer(4) :: ndev
integer(4) :: devlist(*)

ncclCommDestroy

This function frees resources allocated to a NCCL communicator. It will wait for uncompleted operations.

type(ncclResult) function ncclCommDestroy(comm)
type(ncclComm) :: comm

ncclCommFinalize

This function finalizes the NCCL communicator object.

type(ncclResult) function ncclCommFinalize(comm)
type(ncclComm) :: comm

ncclCommSplit

This function creates a new NCCL communicator, of type(ncclCom), from an existing one, based on the input color and key. A variation of this interface will accept c_null_ptr for the last argument.

type(ncclResult) function ncclCommSplit(comm, color, &
        key, newcomm, config)
type(ncclComm) :: comm
integer(4) :: color, key
type(ncclComm) :: newcomm
type(ncclConfig) :: config

ncclCommAbort

This function frees resources allocated to a NCCL communicator. It will abort uncompleted operations.

type(ncclResult) function ncclCommAbort(comm)
type(ncclComm) :: comm

ncclCommRegister

This function registers a CUDA buffer, making it available for a zero-copy operation. The buffer can be of any type.

type(ncclResult) function ncclCommRegister(comm, buff, size, handle)
type(ncclComm) :: comm
real(4), device :: buff(*)  ! Any type of device array
integer(8)      :: size
type(c_ptr), intent(out) :: handle

ncclCommDeregister

This function deregisters a CUDA buffer used for a zero-copy operation.

type(ncclResult) function ncclCommDeregister(comm, handle)
type(ncclComm) :: comm
type(c_ptr), intent(in) :: handle

ncclGetErrorString

This function returns an error string for a given ncclResult value.

character*128 function ncclGetErrorString(ierr)
type(ncclResult) :: ierr

ncclGetLastError

This function returns a string for the last error that occurred.

character*128 function ncclGetLastError(comm)
type(ncclComm) :: comm

ncclCommGetAsyncError

This function queries whether the communicator has encountered any asynchronous errors.

type(ncclResult) function ncclCommGetAsyncError(comm, asyncError)
type(ncclComm) :: comm
type(ncclResult) :: asyncError

ncclCommCount

This function sets the count argument to the number of ranks in the NCCL communicator.

type(ncclResult) function ncclCommCount(comm, count)
type(ncclComm) :: comm
integer(4) :: count

ncclCommCuDevice

This function sets the device argument to the CUDA device associated with a NCCL communicator.

type(ncclResult) function ncclCommCuDevice(comm, device)
type(ncclComm) :: comm
integer(4) :: device

ncclCommUserRank

This function sets the rank argument to the rank within a NCCL communicator.

type(ncclResult) function ncclCommUserRank(comm, rank)
type(ncclComm) :: comm
integer(4) :: rank

NCCL Collective Communication Functions

This section contains interfaces for the NCCL functions that perform collective communication operations on device data. All functions can take either CUDA Fortran device arrays, OpenACC arrays within a host_data use_device data directive, or Fortran type(c_devptr) arguments.

ncclAllReduce

This function performs the specified reduction on data across devices and writes the results into the receive buffer of every rank.

type(ncclResult) function ncclAllReduce(sendbuff, recvbuff, &
    count, datatype, op, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
  ! These combinations of sendbuff, recvbuff are also accepted:
  ! integer(4), device :: sendbuff(*), recvbuff(*)
  ! integer(8), device :: sendbuff(*), recvbuff(*)
  ! real(2), device :: sendbuff(*), recvbuff(*)
  ! real(4), device :: sendbuff(*), recvbuff(*)
  ! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
type(ncclRedOp) :: op
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

ncclBroadcast

This function copies the send buffer on the root rank to all other ranks in the NCCL communicator. An in-place operation will happen if sendbuff and recvbuff are the same address.

type(ncclResult) function ncclBroadcast(sendbuff, recvbuff, &
    count, datatype, root, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
  ! These combinations of sendbuff, recvbuff are also accepted:
  ! integer(4), device :: sendbuff(*), recvbuff(*)
  ! integer(8), device :: sendbuff(*), recvbuff(*)
  ! real(2), device :: sendbuff(*), recvbuff(*)
  ! real(4), device :: sendbuff(*), recvbuff(*)
  ! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
integer(4) :: root
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

ncclReduce

This function performs the same operation as AllReduce, but writes the results only to the receive buffers of the specified root rank.

type(ncclResult) function ncclReduce(sendbuff, recvbuff, &
    count, datatype, op, root, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
  ! These combinations of sendbuff, recvbuff are also accepted:
  ! integer(4), device :: sendbuff(*), recvbuff(*)
  ! integer(8), device :: sendbuff(*), recvbuff(*)
  ! real(2), device :: sendbuff(*), recvbuff(*)
  ! real(4), device :: sendbuff(*), recvbuff(*)
  ! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
type(ncclRedOp) :: op
integer(4) :: root
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

ncclAllGather

This function gathers the send buffers from each rank and stores them in rank order in the receive buffer of all ranks.

type(ncclResult) function ncclAllGather(sendbuff, recvbuff, &
    sendcount, datatype, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
  ! These combinations of sendbuff, recvbuff are also accepted:
  ! integer(4), device :: sendbuff(*), recvbuff(*)
  ! integer(8), device :: sendbuff(*), recvbuff(*)
  ! real(2), device :: sendbuff(*), recvbuff(*)
  ! real(4), device :: sendbuff(*), recvbuff(*)
  ! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: sendcount
type(ncclDataType) :: datatype
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

ncclReduceScatter

This function performs the specified reduction on the data, and leaves the result scattered in equal blocks among the ranks, based on the rank index.

type(ncclResult) function ncclReduceScatter(sendbuff, recvbuff, &
    recvcount, datatype, op, comm, stream)
type(c_devptr) :: sendbuff, recvbuff
  ! These combinations of sendbuff, recvbuff are also accepted:
  ! integer(4), device :: sendbuff(*), recvbuff(*)
  ! integer(8), device :: sendbuff(*), recvbuff(*)
  ! real(2), device :: sendbuff(*), recvbuff(*)
  ! real(4), device :: sendbuff(*), recvbuff(*)
  ! real(8), device :: sendbuff(*), recvbuff(*)
integer(cuda_count_kind) :: recvcount
type(ncclDataType) :: datatype
type(ncclRedOp) :: op
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

NCCL Point To Point Communication Functions

This section contains interfaces for the NCCL functions that perform point to point communication operations on device data. All functions can take either CUDA Fortran device arrays, OpenACC arrays within a host_data use_device data directive, or Fortran type(c_devptr) arguments. The point to point operations were added in NCCL 2.7.

ncclSend

This function sends data from the send buffer to a communicator peer. This operation blocks the GPU. The receiving peer must call ncclRecv, with the same datatype and count.

type(ncclResult) function ncclSend(sendbuff, &
    count, datatype, peer, comm, stream)
type(c_devptr) :: sendbuff
  ! These types for sendbuff are also accepted:
  ! integer(4), device :: sendbuff(*)
  ! integer(8), device :: sendbuff(*)
  ! real(2), device :: sendbuff(*)
  ! real(4), device :: sendbuff(*)
  ! real(8), device :: sendbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
integer(4) :: peer
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

ncclRecv

This function receives data from a communicator peer. This operation blocks the GPU. The sending peer must call ncclSend, with the same datatype and count.

type(ncclResult) function ncclRecv(recvbuff, &
    count, datatype, peer, comm, stream)
type(c_devptr) :: recvbuff
  ! These types for recvbuff are also accepted:
  ! integer(4), device :: recvbuff(*)
  ! integer(8), device :: recvbuff(*)
  ! real(2), device :: recvbuff(*)
  ! real(4), device :: recvbuff(*)
  ! real(8), device :: recvbuff(*)
integer(cuda_count_kind) :: count
type(ncclDataType) :: datatype
integer(4) :: peer
type(ncclComm) :: comm
integer(cuda_stream_kind) :: stream

NCCL Group Calls

This section contains interfaces for the NCCL functions that begin and end a group such that multiple calls can be merged.

ncclGroupStart

This function starts a group call. All subsequent calls to NCCL functions may not block due to inter-CPU synchronization.

type(ncclResult) function ncclGroupStart()

ncclGroupEnd

This function ends a group call. It returns when all operations since the corresponding call to ncclGroupStart have been processed, but not necessarily completed.

type(ncclResult) function ncclGroupEnd()