.. _subsec:coll: Collective Communication ------------------------ *Collective routines* are defined as coordinated communication or synchronization operations performed by a group of PEs. NVSHMEM provides the following types of collective routines: #. Collective routines that operate on active sets use a set of parameters to determine which PEs will participate and what resources are used to perform operations. #. Collective routines that accept no active set parameters implicitly operate on all PEs. Concurrent accesses to symmetric memory by an NVSHMEM collective routine and any other means of access—where at least one updates the symmetric memory—results in undefined behavior. Since PEs can enter and exit collectives at different times, accessing such memory remotely may require additional synchronization. Active-set-based collectives ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The active-set-based collective routines require all PEs in the active set to simultaneously call the routine. A PE that is not in the active set calling the collective routine results in undefined behavior. The active set is defined by the arguments *PE_start*, *logPE_stride*, and *PE_size*. *PE_start* specifies the starting PE number and is the lowest numbered PE in the active set. The stride between successive PEs in the active set is :math:`2^{logPE\_stride}` and *logPE_stride* must be greater than or equal to zero. *PE_size* specifies the number of PEs in the active set and must be greater than zero. The active set must satisfy the requirement that its last member corresponds to a valid PE number, that is :math:`0 \le PE\_start + (PE\_size - 1) * 2^{logPE\_stride} < npes`. All PEs participating in the active-set-based collective routine must provide the same values for these arguments. If any of these requirements are not met, the behavior is undefined. Another argument important to active-set-based collective routines is *pSync*, which is a symmetric work array. All PEs participating in an active-set-based collective must pass the same *pSync* array. Every element of the *pSync* array must be initialized to *NVSHMEM_SYNC_VALUE* before it is used as an argument to any active-set-based collective routine. On completion of such a collective call, the *pSync* is restored to its original contents. The user is permitted to reuse a *pSync* array if all previous collective routines using the *pSync* array have completed on all participating PEs. One can use a synchronization collective routine such as *nvshmem_barrier* to ensure completion of previous active-set-based collective routines. The *nvshmem_barrier* and *shmem_sync* routines allow the same *pSync* array to be used on consecutive calls as long as the PEs in the active set do not change. All collective routines defined by NVSHMEM are blocking. The collective routines return on completion. The active-set-based collective routines defined by NVSHMEM are: - *nvshmem_barrier* - *nvshmem_sync* - *nvshmem_alltoall{32, 64}* - *nvshmem_broadcast{32, 64}* - *nvshmem_collect{32, 64}* - *nvshmem_TYPENAME_{and, or, xor, max, min, sum, prod}_to_all* Implicit active set collectives ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some NVSHMEM collective routines implicitly operate on all PEs. These routines include: - *nvshmem_sync_all*, which synchronizes all PEs in the computation. - *nvshmem_barrier_all*, which synchronizes all PEs in the computation and ensures completion of all local and remote memory updates. - NVSHMEM memory-management routines, which imply one or more calls to a routine equivalent to *nvshmem_barrier_all*. .. _subsec:shmem_barrier_all: **NVSHMEM_BARRIER_ALL** ~~~~~~~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_barrier\_all(void) .. c:function:: void nvshmemx\_barrier\_all_on_stream(void, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_barrier\_all(void) .. c:function:: __device__ void nvshmemx\_barrier\_all_block(void) .. c:function:: __device__ void nvshmemx\_barrier\_all_warp(void) **Description** The *nvshmem_barrier_all* routine is a mechanism for synchronizing all PEs at once. This routine blocks the calling PE until all PEs have called *nvshmem_barrier_all*. In a multithreaded NVSHMEM program, only the calling thread is blocked, however, it may not be called concurrently by multiple threads in the same PE. Prior to synchronizing with other PEs, *nvshmem_barrier_all* ensures completion of all previously issued memory stores and remote memory updates issued NVSHMEMAMOs and RMA routine calls such as *nvshmem_int_add*, *shmem_put32*, *nvshmem_put_nbi*, and *shmem_get_nbi*. **Returns** None. **Notes** The *nvshmem_barrier_all* routine can be used to portably ensure that memory access operations observe remote updates in the order enforced by initiator PEs. Ordering APIs (*nvshmem_fence*, *nvshmem_quiet*, *nvshmem_barrier*, and *nvshmem_barrier_all*) issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively. To ensure completion of GPU-side operations from the CPU, the developer must ensure completion of the CUDA kernel from which the GPU-side operations were issued, using operations like *cudaStreamSynchronize* or *cudaDeviceSynchronize*. Stream-based quiet or barrier operations have the effect of a barrier being executed on the GPU in stream order, ensuring completion and ordering of only GPU-side operations. .. _subsec:shmem_barrier: **NVSHMEM_BARRIER** ~~~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_barrier(int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_barrier_on_stream(int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_barrier(int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_barrier_block(int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_barrier_warp(int PE_start, int logPE_stride, int PE_size, long *pSync) *PE_start [IN]* The lowest PE number of the active set of PEs. *logPE_stride [IN]* The log (base 2) of the stride between consecutive PE numbers in the active set. *PE_size [IN]* The number of PEs in the active set. *pSync [IN]* Symmetric address of a work array of size at least *NVSHMEM_BARRIER_SYNC_SIZE*. **Description** *nvshmem_barrier* is a collective synchronization routine over an active set. Control returns from *nvshmem_barrier* after all PEs in the active set (specified by *PE_start*, *logPE_stride*, and *PE_size*) have called *nvshmem_barrier*. As with all NVSHMEM collective routines, each of these routines assumes that only PEs in the active set call the routine. If a PE not in the active set calls an NVSHMEM collective routine, the behavior is undefined. The values of arguments *PE_start*, *logPE_stride*, and *PE_size* must be the same value on all PEs in the active set. The same work array must be passed in *pSync* to all PEs in the active set. *nvshmem_barrier* ensures that all previously issued stores and remote memory updates, including AMOs and RMA operations, done by any of the PEs in the active set are complete before returning. The same *pSync* array may be reused on consecutive calls to *nvshmem_barrier* if the same active set is used. **Returns** None. **Notes** The *nvshmem_barrier* routine can be used to portably ensure that memory access operations observe remote updates in the order enforced by initiator PEs. Ordering APIs (*nvshmem_fence*, *nvshmem_quiet*, *nvshmem_barrier*, and *nvshmem_barrier_all*) issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively. To ensure completion of GPU-side operations from the CPU, the developer must ensure completion of the CUDA kernel from which the GPU-side operations were issued, using operations like *cudaStreamSynchronize* or *cudaDeviceSynchronize*. Stream-based quiet or barrier operations have the effect of a barrier being executed on the GPU in stream order, ensuring completion and ordering of only GPU-side operations. .. _subsec:shmem_sync: **NVSHMEM_SYNC** ~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_sync(int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_sync_on_stream(int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_sync(int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_sync_block(int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_sync_warp(int PE_start, int logPE_stride, int PE_size, long *pSync) The lowest PE number of the active set of PEs. The log (base 2) of the stride between consecutive PE numbers in the active set. The number of PEs in the active set. Symmetric address of a work array of size at least *NVSHMEM_SYNC_SIZE*. **Description** *nvshmem_sync* is a collective synchronization routine over an active set. The routine registers the arrival of a PE at a synchronization point in the program. This is a fast mechanism for synchronizing all PEs that participate in this collective call. The routine blocks the calling PE until all PEs in the specified active set have called *nvshmem_sync*. In a multithreaded NVSHMEM program, only the calling thread is blocked. Active-set-based sync routines operate over all PEs in the active set defined by the *PE_start*, *logPE_stride*, *PE_size* triplet. As with all active set-based collective routines, each of these routines assumes that only PEs in the active set call the routine. If a PE not in the active set calls an active set-based collective routine, the behavior is undefined. The values of arguments *PE_start*, *logPE_stride*, and *PE_size* must be equal on all PEs in the active set. The same work array must be passed in *pSync* to all PEs in the active set. In contrast with the *nvshmem_barrier* routine, *shmem_sync* only ensures completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates issued via NVSHMEM routines. The same *pSync* array may be reused on consecutive calls to *nvshmem_sync* if the same active set is used. **Returns** Zero on successful local completion. Nonzero otherwise. **Notes** The *nvshmem_sync* routine can be used to portably ensure that memory access operations observe remote updates in the order enforced by the initiator PEs, provided that the initiator PE ensures completion of remote updates with a call to *nvshmem_quiet* prior to the call to the *nvshmem_sync* routine. .. _subsec:shmem_sync_all: **NVSHMEM_SYNC_ALL** ~~~~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_sync\_all(void) .. c:function:: void nvshmemx\_sync\_all_on_stream(void, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_sync\_all(void) .. c:function:: __device__ void nvshmemx\_sync\_all_block(void) .. c:function:: __device__ void nvshmemx\_sync\_all_warp(void) **Description** This routine blocks the calling PE until all PEs in the world team have called *nvshmem_sync_all*. In a multithreaded NVSHMEM program, only the calling thread is blocked. In contrast with the *nvshmem_barrier_all* routine, *nvshmem_sync_all* only ensures completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates issued via NVSHMEM routines. **Returns** None. **Notes** The *nvshmem_sync_all* routine is equivalent to calling *nvshmem_team_sync* on the world team. .. _subsec:shmem_alltoall: **NVSHMEM_ALLTOALL** ~~~~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_alltoall32(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_alltoall32_on_stream(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_alltoall32(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_alltoall32_block(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_alltoall32_warp(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmem\_alltoall64(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_alltoall64_on_stream(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_alltoall64(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_alltoall64_block(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_alltoall64_warp(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) *dest [OUT]* Symmetric address of a data object large enough to receive the combined total of *nelems* elements from each PE in the active set. The type of *dest* should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of a data object that contains *nelems* elements of data for each PE in the active set, ordered according to destination PE. The type of *source* should match that implied in the SYNOPSIS section. *nelems [IN]* The number of elements to exchange for each PE. For *nvshmem_alltoallmem*, elements are bytes; for *nvshmem_alltoall{32,64}*, elements are 4 or 8 bytes, respectively. **Description** The *nvshmem_alltoall* routines are collective routines. Each PE participating in the operation exchanges *nelems* data elements with all other PEs participating in the operation. The size of a data element is: - 32 bits for *nvshmem_alltoall32* - 64 bits for *nvshmem_alltoall64* The data being sent and received are stored in a contiguous symmetric data object. The total size of each PE’s *source* object and *dest* object is *nelems* times the size of an element times *N*, where *N* equals the number of PEs participating in the operation. The *source* object contains *N* blocks of data (where the size of each block is defined by *nelems*) and each block of data is sent to a different PE. The same *dest* and *source* arrays, and same value for nelems must be passed by all PEs that participate in the collective. Given a PE *i* that is the :math:`{\textit{k}^{\text{\tiny th}}}`\ PE participating in the operation and a PE *j* that is the :math:`{\textit{l}^{\text{\tiny th}}}`\ PE participating in the operation, PE *i* sends the :math:`{\textit{l}^{\text{\tiny th}}}`\ block of its *source* object to the :math:`{\textit{k}^{\text{\tiny th}}}`\ block of the *dest* object of PE *j*. Active-set-based collective routines operate over all PEs in the active set defined by the *PE_start*, *logPE_stride*, *PE_size* triplet. As with all active-set-based collective routines, this routine assumes that only PEs in the active set call the routine. If a PE not in the active set calls an active-set-based collective routine, the behavior is undefined. The values of arguments *PE_start*, *logPE_stride*, and *PE_size* must be equal on all PEs in the active set. The same *pSync* work array must be passed to all PEs in the active set. Before any PE calls a *nvshmem_alltoall* routine, the following conditions must be ensured: - The *dest* data object on all PEs in the active set is ready to accept the *nvshmem_alltoall* data. - For active-set-based routines, the *pSync* array on all PEs in the active set is not still in use from a prior call to a *nvshmem_alltoall* routine. Otherwise, the behavior is undefined. Upon return from a *nvshmem_alltoall* routine, the following is true for the local PE: - Its *dest* symmetric data object is completely updated and the data has been copied out of the *source* data object. - For active-set-based routines, the values in the *pSync* array are restored to the original values. **Returns** Zero on successful local completion. Nonzero otherwise. .. _subsec:shmem_broadcast: **NVSHMEM_BROADCAST** ~~~~~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_broadcast32(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_broadcast32_on_stream(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_broadcast32(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_broadcast32_block(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_broadcast32_warp(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmem\_broadcast64(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_broadcast64_on_stream(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_broadcast64(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_broadcast64_block(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_broadcast64_warp(void *dest, const void *source, size_t nelems, int PE_root, int PE_start, int logPE_stride, int PE_size, long *pSync) *dest [OUT]* Symmetric address of destination data object. The type of *dest* should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of the source data object. The type of *source* should match that implied in the SYNOPSIS section. *nelems [IN]* The number of elements in *source* and *dest* arrays. For *nvshmem_broadcastmem*, elements are bytes; for *nvshmem_broadcast{32,64}*, elements are 4 or 8 bytes, respectively. *PE_root [IN]* Zero-based ordinal of the PE, with respect to the active set, from which the data is copied. **Description** NVSHMEM broadcast routines are collective routines over an active set. They copy the *source* data object on the PE specified by *PE_root* to the *dest* data object on the PEs participating in the collective operation. The same *dest* and *source* data objects and the same value of *PE_root* must be passed by all PEs participating in the collective operation. For active-set-based broadcasts: - The *dest* object is updated on all PEs other than the root PE. - All PEs in the active set defined by the *PE_start*, *logPE_stride*, *PE_size* triplet must participate in the operation. - Only PEs in the active set may call the routine. If a PE not in the active set calls an active-set-based collective routine, the behavior is undefined. - The values of arguments *PE_root*, *PE_start*, *logPE_stride*, and *PE_size* must be the same value on all PEs in the active set. - The value of *PE_root* must be between *0* and *PE_size :math:`-` 1*. - The same *pSync* work array must be passed by all PEs in the active set. Before any PE calls a broadcast routine, the following conditions must be ensured: - The *dest* array on all PEs participating in the broadcast is ready to accept the broadcast data. - For active-set-based broadcasts, the *pSync* array on all PEs in the active set is not still in use from a prior call to an NVSHMEM collective routine. Otherwise, the behavior is undefined. Upon return from a broadcast routine, the following are true for the local PE: - For active-set-based broadcasts: - If the current PE is not the root PE, the *dest* data object is updated. - The values in the *pSync* array are restored to the original values. - The *source* data object may be safely reused. **Returns** For active-set-based broadcasts, none. **Notes** Team handle error checking and integer return codes are currently undefined. Implementations may define these behaviors as needed, but programs should ensure portability by doing their own checks for invalid team handles and for *NVSHMEM_TEAM_INVALID*. .. _subsec:shmem_collect: **NVSHMEM_COLLECT** ~~~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_collect32(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_collect32_on_stream(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_collect32(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_collect32_block(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_collect32_warp(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmem\_collect64(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: void nvshmemx\_collect64_on_stream(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_collect64(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_collect64_block(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) .. c:function:: __device__ void nvshmemx\_collect64_warp(void *dest, const void *source, size_t nelems, int PE_start, int logPE_stride, int PE_size, long *pSync) *dest [OUT]* Symmetric address of an array large enough to accept the concatenation of the *source* arrays on all participating PEs. The type of *dest* should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of the source data object. The type of *source* should match that implied in the SYNOPSIS section. *nelems [IN]* The number of elements in *source* array. For *nvshmem_[f]collectmem*, elements are bytes; for *nvshmem_[f]collect{32,64}*, elements are 4 or 8 bytes, respectively. **Description** NVSHMEM\ *collect* routines perform a collective operation to concatenate *nelems* data items from the *source* array into the *dest* array, over an active set in processor number order. The resultant *dest* array contains the contribution from PEs as follows: - For an active set, the data from PE *PE_start* is first, then the contribution from PE *PE_start* + *PE_stride* second, and so on. The collected result is written to the *dest* array for all PEs that participate in the operation. The same *dest* and *source* arrays must be passed by all PEs that participate in the operation. The *collect* routines allow *nelems* to vary from PE to PE. Active-set-based collective routines operate over all PEs in the active set defined by the *PE_start*, *logPE_stride*, *PE_size* triplet. As with all active-set-based collective routines, each of these routines assumes that only PEs in the active set call the routine. If a PE not in the active set and calls this collective routine, the behavior is undefined. The values of arguments *PE_start*, *logPE_stride*, and *PE_size* must be the same value on all PEs in the active set. The same *pSync* work array must be passed by all PEs in the active set. Upon return from a collective routine, the following are true for the local PE: - The *dest* array is updated and the *source* array may be safely reused. - For active-set-based collective routines, the values in the *pSync* array are restored to the original values. **Returns** Zero on successful local completion. Nonzero otherwise. **Notes** The collective routines operate on active PE sets that have a non-power-of-two *PE_size* with some performance degradation. They operate with no performance degradation when *nelems* is a non-power-of-two value. .. _subsec:shmem_reductions: **NVSHMEM_REDUCTIONS** ~~~~~~~~~~~~~~~~~~~~~~ .. _reducetypes: .. table:: Reduction Types, Names and Supporting Operations ================ ========== ============ ======== ========= *TYPE* *TYPENAME* ================ ========== ============ ======== ========= short short AND, OR, XOR MAX, MIN SUM, PROD int int AND, OR, XOR MAX, MIN SUM, PROD long long AND, OR, XOR MAX, MIN SUM, PROD long long longlong AND, OR, XOR MAX, MIN SUM, PROD float float MAX, MIN SUM, PROD double double MAX, MIN SUM, PROD double \_Complex complexd SUM, PROD float \_Complex complexf SUM, PROD ================ ========== ============ ======== ========= .. _subsec:shmem_and_reduce: AND ^^^ Performs a bitwise AND reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_and\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_and\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_and\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_and\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_and\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer types supported for the AND operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _subsec:shmem_or_reduce: OR ^^ Performs a bitwise OR reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_or\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_or\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_or\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_or\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_or\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer types supported for the OR operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _subsec:shmem_xor_reduce: XOR ^^^ Performs a bitwise exclusive OR (XOR) reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_xor\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_xor\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_xor\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_xor\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_xor\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer types supported for the XOR operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _subsec:shmem_max_reduce: MAX ^^^ Performs a maximum-value reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_max\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_max\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_max\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_max\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_max\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer or real types supported for the MAX operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _subsec:shmem_min_reduce: MIN ^^^ Performs a minimum-value reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_min\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_min\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_min\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_min\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_min\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer or real types supported for the MIN operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _subsec:shmem_sum_reduce: SUM ^^^ Performs a sum reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_sum\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_sum\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_sum\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_sum\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_sum\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer, real, or complex types supported for the SUM operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _subsec:shmem_prod_reduce: PROD ^^^^ Performs a product reduction across a set of PEs. .. c:function:: void nvshmem\_TYPENAME\_prod\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: void nvshmemx\_TYPENAME\_prod\_to\_all_on_stream(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_prod\_to\_all(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_prod\_to\_all_block(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) .. c:function:: __device__ void nvshmemx\_TYPENAME\_prod\_to\_all_warp(TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync) where *TYPE* is one of the integer, real, or complex types supported for the PROD operation and has a corresponding *TYPENAME* as specified by Table :ref:`asetreducetypes`. .. _asetreducetypes: .. table:: Reduction Types, Names and Supporting Operations ================ ========== ============ ======== ========= *TYPE* *TYPENAME* ================ ========== ============ ======== ========= short short AND, OR, XOR MAX, MIN SUM, PROD int int AND, OR, XOR MAX, MIN SUM, PROD long long AND, OR, XOR MAX, MIN SUM, PROD long long longlong AND, OR, XOR MAX, MIN SUM, PROD float float MAX, MIN SUM, PROD double double MAX, MIN SUM, PROD long double longdouble MAX, MIN SUM, PROD double \_Complex complexd SUM, PROD float \_Complex complexf SUM, PROD ================ ========== ============ ======== ========= *dest [OUT]* Symmetric address of an array, of length *nreduce* elements, to receive the result of the reduction routines. The type of *dest* should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of an array, of length *nreduce* elements, that contains one element for each separate reduction routine. The type of *source* should match that implied in the SYNOPSIS section. *nreduce [IN]* The number of elements in the *dest* and *source* arrays. **Description** NVSHMEM reduction routines are collective routines over an active set that compute one or more reductions across symmetric arrays on multiple PEs. A reduction performs an associative binary routine across a set of values. The *nreduce* argument determines the number of separate reductions to perform. The *source* array on all PEs participating in the reduction provides one element for each reduction. The results of the reductions are placed in the *dest* array on all PEs participating in the reduction. The *source* and *dest* arguments must either be the same symmetric address, or two different symmetric addresses corresponding to buffers that do not overlap in memory. That is, they must be completely overlapping or completely disjoint. Active-set-based sync routines operate over all PEs in the active set defined by the *PE_start*, *logPE_stride*, *PE_size* triplet. As with all active set-based collective routines, each of these routines assumes that only PEs in the active set call the routine. If a PE not in the active set calls an active set-based collective routine, the behavior is undefined. The values of arguments *nreduce*, *PE_start*, *logPE_stride*, and *PE_size* must be equal on all PEs in the active set. The same *pWrk* and *pSync* work arrays must be passed to all PEs in the active set. Before any PE calls a reduction routine, the following conditions must be ensured: - The *dest* array on all PEs participating in the reduction is ready to accept the results of the *reduction*. - If using active-set-based routines, the *pWrk* and *pSync* arrays on all PEs in the active set are not still in use from a prior call to a collective NVSHMEM routine. Otherwise, the behavior is undefined. Upon return from a reduction routine, the following are true for the local PE: - The *dest* array is updated and the *source* array may be safely reused. - If using active-set-based routines, the values in the *pSync* array are restored to the original values. The complex-typed interfaces are only provided for sum and product reductions. When the *C* translation environment does not support complex types [1]_, an NVSHMEM implementation is not required to provide support for these complex-typed interfaces. **Returns** Zero on successful local completion. Nonzero otherwise. .. [1] That is, under *C* language standards prior to *C* or under *C* when *\__STDC_NO_COMPLEX_\_* is defined to 1