Device API

Host-Side Setup

ncclDevComm

ncclDevComm

A structure describing a device communicator, as created on the host side using ncclDevCommCreate(). The structure is used primarily on the device side; elements that could be of particular interest include:

rank
nRanks

Rank and size of the communicator.

lsaRank
lsaSize

Rank and size within the LSA team (the subset of communicator ranks that are load/store accessible). For now these hold the same values as the first group, but things will change with the addition of networking support.

ncclDevCommCreate

ncclResult_t ncclDevCommCreate(ncclComm_t comm, struct ncclDevCommRequirements const* reqs, struct ncclDevComm* outDevComm)

Creates a new device communicator (see ncclDevComm) corresponding to the supplied host-side communicator comm. The result is returned in the outDevComm buffer (which needs to be supplied by the caller). The caller needs to also provide a filled-in list of requirements via the reqs argument (see ncclDevCommRequirements); the function will allocate any needed resources to meet them. The function can fail and return an error code if the communicator does not support symmetric memory or if the list of requirements cannot be met (e.g., if the multimem capability is requested on a system lacking the necessary hardware support).

Note that this is a host-side function.

ncclDevCommDestroy

ncclResult_t ncclDevCommDestroy(ncclComm_t comm, struct ncclDevComm const* devComm)

Destroys a device communicator (see ncclDevComm) previously created using ncclDevCommCreate() and releases any allocated resources. The caller must ensure that no device kernel that uses this device communicator could be running at the time this function is invoked.

Note that this is a host-side function.

ncclDevCommRequirements

ncclDevCommRequirements

A host-side structure specifying the list of requirements when creating device communicators (see ncclDevComm).

lsaMultimem

Specifies whether multimem support is required for all LSA ranks.

lsaBarrierCount

Specifies the number of memory barriers to allocate (see ncclLsaBarrierSession).

resourceRequirementsList

Specifies a list of resource requirements. This is best set to NULL for now.

teamRequirementsList

Specifies a list of requirements for particular teams. This is best set to NULL for now.

LSA

All functionality described from this point on is available on the device side only.

ncclLsaBarrierSession

ncclLsaBarrierSession

A class representing a memory barrier session.

ncclLsaBarrierSession(Coop coop, ncclDevComm const& comm, ncclTeamTagLsa, uint32_t index, bool multimem=false)

Initializes a new memory barrier session. coop represents a cooperative group (typically ncclCoopCta() for all threads within the current CTA). comm is the device communicator created using ncclDevCommCreate(). ncclTeamTagLsa is here to indicate which subset of ranks the barrier will apply to. The identifier of the underlying barrier to use is provided by index (it should be different for each coop; typically set to blockIdx.x to ensure uniqueness between CTAs). multimem requests a hardware-accelerated implementation using memory multicast.

void arrive(Coop, cuda::memory_order order)

Signals the arrival of the thread at the barrier session.

void wait(Coop, cuda::memory_order order)

Blocks until all threads arrive at the barrier session.

void sync(Coop, cuda::memory_order order)

Synchronizes all threads that participate in the barrier session (combines arrive and wait).

ncclGetPeerPointer

void* ncclGetPeerPointer(ncclWindow_t w, size_t offset, int peer)

Returns a load/store accessible pointer to the memory buffer of device peer within the window w. offset is byte-based. peer is a rank index within the world team (the rank within the communicator that was used when creating the window – see ncclCommWindowRegister()). This function will return NULL if the peer is not within the LSA team.

ncclGetLsaPointer

void* ncclGetLsaPointer(ncclWindow_t w, size_t offset, int lsaPeer)

Returns a load/store accessible pointer to the memory buffer of device lsaPeer within the window w. offset is byte-based. This is similar to ncclGetPeerPointer(), but here lsaPeer is a rank index with the LSA team (the subset of communicator ranks that are load/store accessible). This is only a theoretical distinction for now but it will become significant when the networking support for symmetric kernels is complete.

ncclGetLocalPointer

void* ncclGetLocalPointer(ncclWindow_t w, size_t offset)

Returns a load-store accessible pointer to the memory buffer of the current device within the window w. offset is byte-based. This is just a shortcut version of ncclGetPeerPointer() with devComm.rank as peer, or ncclGetLsaPointer() with devComm.lsaRank as lsaPeer.

Multimem

ncclGetLsaMultimemPointer

void* ncclGetLsaMultimemPointer(ncclWindow_t w, size_t offset, ncclDevComm const& devComm)

Returns a multicast memory pointer associated with the window w and device communicator devComm. offset is byte-based. Availability of multicast memory is hardware-dependent.