Device API¶
Host-Side Setup¶
ncclDevComm¶
-
ncclDevComm
¶
A structure describing a device communicator, as created on the host side using ncclDevCommCreate()
. The
structure is used primarily on the device side; elements that could be of particular interest include:
ncclDevCommCreate¶
-
ncclResult_t
ncclDevCommCreate
(ncclComm_t comm, struct ncclDevCommRequirements const* reqs, struct ncclDevComm* outDevComm)¶
Creates a new device communicator (see ncclDevComm
) corresponding to the supplied host-side communicator
comm. The result is returned in the outDevComm buffer (which needs to be supplied by the caller). The caller needs
to also provide a filled-in list of requirements via the reqs argument (see ncclDevCommRequirements
); the
function will allocate any needed resources to meet them. The function can fail and return an error code if the
communicator does not support symmetric memory or if the list of requirements cannot be met (e.g., if the multimem
capability is requested on a system lacking the necessary hardware support).
Note that this is a host-side function.
ncclDevCommDestroy¶
-
ncclResult_t
ncclDevCommDestroy
(ncclComm_t comm, struct ncclDevComm const* devComm)¶
Destroys a device communicator (see ncclDevComm
) previously created using ncclDevCommCreate()
and
releases any allocated resources. The caller must ensure that no device kernel that uses this device communicator could
be running at the time this function is invoked.
Note that this is a host-side function.
ncclDevCommRequirements¶
-
ncclDevCommRequirements
¶
A host-side structure specifying the list of requirements when creating device communicators (see
ncclDevComm
).
lsaMultimem
¶Specifies whether multimem support is required for all LSA ranks.
lsaBarrierCount
¶Specifies the number of memory barriers to allocate (see
ncclLsaBarrierSession
).
resourceRequirementsList
¶Specifies a list of resource requirements. This is best set to NULL for now.
teamRequirementsList
¶Specifies a list of requirements for particular teams. This is best set to NULL for now.
LSA¶
All functionality described from this point on is available on the device side only.
ncclLsaBarrierSession¶
-
ncclLsaBarrierSession
¶
A class representing a memory barrier session.
ncclLsaBarrierSession
(Coop coop, ncclDevComm const& comm, ncclTeamTagLsa, uint32_t index, bool multimem=false)Initializes a new memory barrier session. coop represents a cooperative group (typically
ncclCoopCta()
for all threads within the current CTA). comm is the device communicator created usingncclDevCommCreate()
. ncclTeamTagLsa is here to indicate which subset of ranks the barrier will apply to. The identifier of the underlying barrier to use is provided by index (it should be different for each coop; typically set toblockIdx.x
to ensure uniqueness between CTAs). multimem requests a hardware-accelerated implementation using memory multicast.
- void
arrive
(Coop, cuda::memory_order order)¶Signals the arrival of the thread at the barrier session.
- void
wait
(Coop, cuda::memory_order order)¶Blocks until all threads arrive at the barrier session.
- void
sync
(Coop, cuda::memory_order order)¶Synchronizes all threads that participate in the barrier session (combines
arrive
andwait
).
ncclGetPeerPointer¶
-
void*
ncclGetPeerPointer
(ncclWindow_t w, size_t offset, int peer)¶
Returns a load/store accessible pointer to the memory buffer of device peer within the window w. offset is
byte-based. peer is a rank index within the world team (the rank within the communicator that was used when creating
the window – see ncclCommWindowRegister()
). This function will return NULL if the peer is not within the LSA
team.
ncclGetLsaPointer¶
-
void*
ncclGetLsaPointer
(ncclWindow_t w, size_t offset, int lsaPeer)¶
Returns a load/store accessible pointer to the memory buffer of device lsaPeer within the window w. offset is
byte-based. This is similar to ncclGetPeerPointer()
, but here lsaPeer is a rank index with the LSA team (the
subset of communicator ranks that are load/store accessible). This is only a theoretical distinction for now but it
will become significant when the networking support for symmetric kernels is complete.
ncclGetLocalPointer¶
-
void*
ncclGetLocalPointer
(ncclWindow_t w, size_t offset)¶
Returns a load-store accessible pointer to the memory buffer of the current device within the window w. offset is
byte-based. This is just a shortcut version of ncclGetPeerPointer()
with devComm.rank as peer, or ncclGetLsaPointer()
with devComm.lsaRank as lsaPeer.
Multimem¶
ncclGetLsaMultimemPointer¶
-
void*
ncclGetLsaMultimemPointer
(ncclWindow_t w, size_t offset, ncclDevComm const& devComm)¶
Returns a multicast memory pointer associated with the window w and device communicator devComm. offset is byte-based. Availability of multicast memory is hardware-dependent.