Device API – Memory and LSA
This page documents device-side memory and LSA (load/store accessible) functionality. For host-accessible device pointer functions, see Host-Accessible Device Pointer Functions in the setup guide.
LSA
Device functions. The following are callable from device (GPU) code only. LSA is used by the pointer accessors below.
ncclLsaBarrierSession
-
template<typename Coop>
class ncclLsaBarrierSession A class representing a memory barrier session.
-
ncclLsaBarrierSession(Coop coop, ncclDevComm const &comm, ncclTeamTagLsa tag, uint32_t index, bool multimem = false)
Initializes a new memory barrier session. coop represents a cooperative group (see Thread Groups). comm is the device communicator created using
ncclDevCommCreate(). ncclTeamTagLsa is here to indicate which subset of ranks the barrier will apply to. The identifier of the underlying barrier to use is provided by index (it should be different for each coop; typically set toblockIdx.xto ensure uniqueness between CTAs). multimem requests a hardware-accelerated implementation using memory multicast.
-
void arrive(Coop, cuda::memory_order order)
Signals the arrival of the thread at the barrier session.
-
ncclLsaBarrierSession(Coop coop, ncclDevComm const &comm, ncclTeamTagLsa tag, uint32_t index, bool multimem = false)
ncclGetPeerPointer
-
void *ncclGetPeerPointer(ncclWindow_t w, size_t offset, int peer)
Returns a load/store accessible pointer to the memory buffer of device peer within the window w. offset is byte-based. peer is a rank index within the world team (see Teams). This function will return NULL if the peer is not within the LSA team.
ncclGetLsaPointer
-
void *ncclGetLsaPointer(ncclWindow_t w, size_t offset, int lsaPeer)
Returns a load/store accessible pointer to the memory buffer of device lsaPeer within the window w. offset is byte-based. This is similar to
ncclGetPeerPointer, but here lsaPeer is a rank index within the LSA team (see Teams). For high-level reduce and copy operations over LSA memory, see Device API – Remote Reduce and Copy: Building Blocks for Custom Communication Kernels.
ncclGetLocalPointer
-
void *ncclGetLocalPointer(ncclWindow_t w, size_t offset)
Returns a load-store accessible pointer to the memory buffer of the current device within the window w. offset is byte-based. This is just a shortcut version of
ncclGetPeerPointerwith devComm.rank as peer, orncclGetLsaPointerwith devComm.lsaRank as lsaPeer.
Multimem
ncclGetLsaMultimemPointer
-
void *ncclGetLsaMultimemPointer(ncclWindow_t w, size_t offset, ncclDevComm const &devComm)
Returns a multicast memory pointer associated with the window w and device communicator devComm. offset is byte-based. Availability of multicast memory is hardware-dependent.