Collective Communication

Collective routines are defined as coordinated communication or synchronization operations performed by a group of PEs.

NVSHMEM provides the following types of collective routines:

  1. Collective routines that operate on teams use a team handle parameter to determine which PEs will participate in the routine, and use resources encapsulated by the team object to perform operations. See Section Team Management for details on team management.
  2. Collective routines that accept no team parameters implicitly operate on all PEs.

Concurrent accesses to symmetric memory by an NVSHMEM collective routine and any other means of access—where at least one updates the symmetric memory—results in undefined behavior. Since PEs can enter and exit collectives at different times, accessing such memory remotely may require additional synchronization.

Team-based collectives

The team-based collective routines are performed with respect to a valid NVSHMEM team, which is specified by a team handle argument. Team-based collective operations require all PEs in the team to call the routine in order for the operation to complete. If an invalid team handle or NVSHMEM_TEAM_INVALID is passed to a team-based collective routine, the behavior is undefined.

All NVSHMEM teams-based collective operations are blocking routines. On return from a team-based collective call, the PE may immediately call another collective operation on that same team. Team-based collectives must occur in the same program order across all PEs in a team.

While NVSHMEM routines provide thread support according to the thread-support level provided at initialization (see Section Thread Support), team-based collective routines may not be called simultaneously by multiple threads on a given team.

The team-based collective routines defined in the NVSHMEM Specification are:

  • nvshmem_team_sync
  • nvshmem_[TYPENAME_]alltoall
  • nvshmem_[TYPENAME_]broadcast
  • nvshmem_[TYPENAME_]collect
  • nvshmem_[TYPENAME_]{and, or, xor, max, min, sum, prod}_reduce

In addition, all team creation functions are collective operations. In addition to the ordering and thread safety requirements described here, there are additional synchronization requirements on team creation operations. See Section Team Management for more details.

Implicit team collectives

Some NVSHMEM collective routines implicitly operate on all PEs. These routines include:

  • nvshmem_sync_all, which synchronizes all PEs in the computation.
  • nvshmem_barrier_all, which synchronizes all PEs in the computation and ensures completion of all local and remote memory updates.
  • NVSHMEM memory-management routines, which imply one or more calls to a routine equivalent to nvshmem_barrier_all.

Error codes returned from team-based collectives

Collective operations involving multiple PEs may return values indicating success while other PEs are still executing the collective operation. Return values indicating success or failure of a collective routine on one PE may not indicate that all PEs involved in the collective operation will return the same value. Some operations, such as team creation, must return identical return codes across multiple PEs.

NVSHMEM_BARRIER_ALL

void nvshmem_barrier_all(void)
void nvshmemx_barrier_all_on_stream(void, cudaStream_t stream)
__device__ void nvshmem_barrier_all(void)
__device__ void nvshmemx_barrier_all_block(void)
__device__ void nvshmemx_barrier_all_warp(void)

Description

The nvshmem_barrier_all routine is a mechanism for synchronizing all PEs at once. This routine blocks the calling PE until all PEs have called nvshmem_barrier_all. In a multithreaded NVSHMEM program, only the calling thread is blocked, however, it may not be called concurrently by multiple threads in the same PE.

Prior to synchronizing with other PEs, nvshmem_barrier_all ensures completion of all previously issued memory stores and remote memory updates issued NVSHMEMAMOs and RMA routine calls such as nvshmem_int_add, nvshmem_put32, nvshmem_put_nbi, and nvshmem_get_nbi.

Returns

None.

Notes

The nvshmem_barrier_all routine can be used to portably ensure that memory access operations observe remote updates in the order enforced by initiator PEs.

Ordering APIs (nvshmem_fence, nvshmem_quiet, nvshmem_barrier, and nvshmem_barrier_all) issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively. To ensure completion of GPU-side operations from the CPU, the developer must perform a GPU-side quiet operation and ensure completion of the CUDA kernel from which the GPU-side operations were issued, using operations like cudaStreamSynchronize or cudaDeviceSynchronize. Alternatively, a stream-based quiet operation can be used. Stream-based quiet operations have the effect of a quiet being executed on the GPU in stream order, ensuring completion and ordering of only GPU-side operations.

NVSHMEM_BARRIER

void nvshmem_barrier(nvshmem_team_t team)
void nvshmemx_barrier_on_stream(nvshmem_team_t team, cudaStream_t stream)
__device__ void nvshmem_barrier(nvshmem_team_t team)
__device__ void nvshmemx_barrier_block(nvshmem_team_t team)
__device__ void nvshmemx_barrier_warp(nvshmem_team_t team)
team [IN]
The lowest PE number of the active set of PEs.

Description

nvshmem_barrier is a collective synchronization routine over a team. Control returns from nvshmem_barrier after all PEs in the team have called nvshmem_barrier.

As with all NVSHMEM collective routines, each of these routines assumes that only PEs in the active set call the routine. If a PE not in the active set calls an NVSHMEM collective routine, the behavior is undefined.

nvshmem_barrier ensures that all previously issued stores and remote memory updates, including AMOs and RMA operations, done by any of the PEs in the active set are complete before returning. On systems with only NVLink, updates are globally visible, whereas on systems with both NVLink and InfiniBand, NVSHMEM only guarantees that updates to the memory of a given PE are visible to that PE.

Returns

None.

Notes

The nvshmem_barrier routine can be used to portably ensure that memory access operations observe remote updates in the order enforced by initiator PEs.

Ordering APIs (nvshmem_fence, nvshmem_quiet, nvshmem_barrier, and nvshmem_barrier_all) issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively. To ensure completion of GPU-side operations from the CPU, the developer must perform a GPU-side quiet operation and ensure completion of the CUDA kernel from which the GPU-side operations were issued, using operations like cudaStreamSynchronize or cudaDeviceSynchronize. Alternatively, a stream-based quiet operation can be used. Stream-based quiet operations have the effect of a quiet being executed on the GPU in stream order, ensuring completion and ordering of only GPU-side operations.

NVSHMEM_SYNC

int nvshmem_sync(nvshmem_team_t team)
int nvshmemx_sync_on_stream(nvshmem_team_t team, cudaStream_t stream)
__device__ int nvshmem_sync(nvshmem_team_t team)
__device__ int nvshmemx_sync_block(nvshmem_team_t team)
__device__ int nvshmemx_sync_warp(nvshmem_team_t team)
int nvshmem_team_sync(nvshmem_team_t team)
int nvshmemx_team_sync_on_stream(nvshmem_team_t team, cudaStream_t stream)
__device__ int nvshmem_team_sync(nvshmem_team_t team)
__device__ int nvshmemx_team_sync_block(nvshmem_team_t team)
__device__ int nvshmemx_team_sync_warp(nvshmem_team_t team)
team [IN]
The team over which to perform the operation.

Description

nvshmem_sync is a collective synchronization routine over a team.

The routine registers the arrival of a PE at a synchronization point in the program. This is a fast mechanism for synchronizing all PEs that participate in this collective call. The routine blocks the calling PE until all PEs in the specified team have called nvshmem_sync. In a multithreaded NVSHMEM program, only the calling thread is blocked.

Team-based sync routines operate over all PEs in the provided team argument. All PEs in the provided team must participate in the sync operation. If team compares equal to NVSHMEM_TEAM_INVALID or is otherwise invalid, the behavior is undefined.

In contrast with the nvshmem_barrier routine, nvshmem_sync only ensures completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates issued via NVSHMEM routines.

Returns

Zero on successful local completion. Nonzero otherwise.

Notes

The nvshmem_sync routine can be used to portably ensure that memory access operations observe remote updates in the order enforced by the initiator PEs, provided that the initiator PE ensures completion of remote updates with a call to nvshmem_quiet prior to the call to the nvshmem_sync routine.

NVSHMEM_SYNC_ALL

void nvshmem_sync_all(void)
void nvshmemx_sync_all_on_stream(void, cudaStream_t stream)
__device__ void nvshmem_sync_all(void)
__device__ void nvshmemx_sync_all_block(void)
__device__ void nvshmemx_sync_all_warp(void)

Description

This routine blocks the calling PE until all PEs in the world team have called nvshmem_sync_all.

In a multithreaded NVSHMEM program, only the calling thread is blocked.

In contrast with the nvshmem_barrier_all routine, nvshmem_sync_all only ensures completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates issued via NVSHMEM routines.

Returns

None.

Notes

The nvshmem_sync_all routine is equivalent to calling nvshmem_team_sync on the world team.

NVSHMEM_ALLTOALL

int nvshmem_TYPENAME_alltoall(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
int nvshmemx_TYPENAME_alltoall_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_alltoall(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
__device__ int nvshmemx_TYPENAME_alltoall_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
__device__ int nvshmemx_TYPENAME_alltoall_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)

where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types and Names.

int nvshmem_alltoallmem(shmem_team_t team, void *dest, const void *source, size_t nelems)
int nvshmemx_alltoallmem_on_stream(shmem_team_t team, void *dest, const void *source, size_t nelems, cudaStream_t stream)
__device__ int nvshmem_alltoallmem(shmem_team_t team, void *dest, const void *source, size_t nelems)
__device__ int nvshmemx_alltoallmem_block(shmem_team_t team, void *dest, const void *source, size_t nelems)
__device__ int nvshmemx_alltoallmem_warp(shmem_team_t team, void *dest, const void *source, size_t nelems)
team [IN]
A valid NVSHMEM team handle to a team.
dest [OUT]
Symmetric address of a data object large enough to receive the combined total of nelems elements from each PE in the team. The type of dest should match that implied in the SYNOPSIS section.
source [IN]
Symmetric address of a data object that contains nelems elements of data for each PE in the team, ordered according to destination PE. The type of source should match that implied in the SYNOPSIS section.
nelems [IN]
The number of elements to exchange for each PE. For nvshmem_alltoallmem, elements are bytes; for nvshmem_alltoall{32,64}, elements are 4 or 8 bytes, respectively.

Description

The nvshmem_alltoall routines are collective routines. Each PE participating in the operation exchanges nelems data elements with all other PEs participating in the operation.

The data being sent and received are stored in a contiguous symmetric data object. The total size of each PE’s source object and dest object is nelems times the size of an element times N, where N equals the number of PEs participating in the operation. The source object contains N blocks of data (where the size of each block is defined by nelems) and each block of data is sent to a different PE.

The same dest and source arrays, and same value for nelems must be passed by all PEs that participate in the collective.

Given a PE i that is the \({\textit{k}^{\text{\tiny th}}}\) PE participating in the operation and a PE j that is the \({\textit{l}^{\text{\tiny th}}}\) PE participating in the operation,

PE i sends the \({\textit{l}^{\text{\tiny th}}}\) block of its source object to the \({\textit{k}^{\text{\tiny th}}}\) block of the dest object of PE j.

Team-based collect routines operate over all PEs in the provided team argument. All PEs in the provided team must participate in the collective. If team compares equal to NVSHMEM_TEAM_INVALID or is otherwise invalid, the behavior is undefined.

Before any PE calls a nvshmem_alltoall routine, the following conditions must be ensured:

  • The dest data object on all PEs in the active set is ready to accept the nvshmem_alltoall data.

Otherwise, the behavior is undefined.

Upon return from a nvshmem_alltoall routine, the following is true for the local PE:

  • Its dest symmetric data object is completely updated and the data has been copied out of the source data object.

Returns

Zero on successful local completion. Nonzero otherwise.

NVSHMEM_BROADCAST

int nvshmem_TYPENAME_broadcast(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)
int nvshmemx_TYPENAME_broadcast_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_broadcast(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)
__device__ int nvshmemx_TYPENAME_broadcast_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)
__device__ int nvshmemx_TYPENAME_broadcast_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)

where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types and Names.

team [IN]
The team over which to perform the operation.
dest [OUT]
Symmetric address of destination data object. The type of dest should match that implied in the SYNOPSIS section.
source [IN]
Symmetric address of the source data object. The type of source should match that implied in the SYNOPSIS section.
nelems [IN]
The number of elements in source and dest arrays. For nvshmem_broadcastmem, elements are bytes; for nvshmem_broadcast{32,64}, elements are 4 or 8 bytes, respectively.
PE_root [IN]
Zero-based ordinal of the PE, with respect to the team, from which the data is copied.

Description

NVSHMEM broadcast routines are collective routines over a team. They copy the source data object on the PE specified by PE_root to the dest data object on the PEs participating in the collective operation. The same dest and source data objects and the same value of PE_root must be passed by all PEs participating in the collective operation.

For team-based broadcasts:

  • The dest object is updated on all PEs.
  • All PEs in the team argument must participate in the operation.
  • If team compares equal to NVSHMEM_TEAM_INVALID or is otherwise invalid, the behavior is undefined.
  • PE numbering is relative to the team. The specified root PE must be a valid PE number for the team, between 0 and N-1, where N is the size of the team.

Before any PE calls a broadcast routine, the following conditions must be ensured:

  • The dest array on all PEs participating in the broadcast is ready to accept the broadcast data.

Otherwise, the behavior is undefined.

Upon return from a broadcast routine, the following are true for the local PE:

  • For team-based broadcasts, the dest data object is updated.
  • The source data object may be safely reused.

Returns

For team-based broadcasts, zero on successful local completion; otherwise, nonzero.

Notes

Team handle error checking and integer return codes are currently undefined. Implementations may define these behaviors as needed, but programs should ensure portability by doing their own checks for invalid team handles and for NVSHMEM_TEAM_INVALID.

NVSHMEM_FCOLLECT

int nvshmem_TYPENAME_fcollect(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
int nvshmemx_TYPENAME_fcollect_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_fcollect(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
__device__ int nvshmemx_TYPENAME_fcollect_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
__device__ int nvshmemx_TYPENAME_fcollect_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)

where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types and Names.

team [IN]
A valid NVSHMEM team handle.
dest [OUT]
Symmetric address of an array large enough to accept the concatenation of the source arrays on all participating PEs. The type of dest should match that implied in the SYNOPSIS section.
source [IN]
Symmetric address of the source data object. The type of source should match that implied in the SYNOPSIS section.
nelems [IN]
The number of elements in source array. For nvshmem_[f]collectmem, elements are bytes; for nvshmem_[f]collect{32,64}, elements are 4 or 8 bytes, respectively.

Description

NVSHMEM fcollect routines perform a collective operation to concatenate nelems data items from the source array into the dest array, over a team in processor number order. The resultant dest array contains the contribution from PEs as follows:

  • For a team, the data from PE number 0 in the team is first, then the contribution from PE 1 in the team, and so on.

The collected result is written to the dest array for all PEs that participate in the operation. The same dest and source arrays must be passed by all PEs that participate in the operation.

The

Team-based collect routines operate over all PEs in the provided team argument. All PEs in the provided team must participate in the operation. If team compares equal to NVSHMEM_TEAM_INVALID or is otherwise invalid, the behavior is undefined.

Upon return from a collective routine, the following are true for the local PE:

  • The dest array is updated and the source array may be safely reused.

Returns

Zero on successful local completion. Nonzero otherwise.

NVSHMEM_REDUCTIONS

Types, Names, and Operations for Team-Based Reductions
TYPE TYPENAME      
char char   MAX, MIN SUM, PROD
signed char schar   MAX, MIN SUM, PROD
short short   MAX, MIN SUM, PROD
int int   MAX, MIN SUM, PROD
long long   MAX, MIN SUM, PROD
long long longlong   MAX, MIN SUM, PROD
ptrdiff_t ptrdiff   MAX, MIN SUM, PROD
unsigned char uchar AND, OR, XOR MAX, MIN SUM, PROD
unsigned short ushort AND, OR, XOR MAX, MIN SUM, PROD
unsigned int uint AND, OR, XOR MAX, MIN SUM, PROD
unsigned long ulong AND, OR, XOR MAX, MIN SUM, PROD
unsigned long long ulonglong AND, OR, XOR MAX, MIN SUM, PROD
int8_t int8 AND, OR, XOR MAX, MIN SUM, PROD
int16_t int16 AND, OR, XOR MAX, MIN SUM, PROD
int32_t int32 AND, OR, XOR MAX, MIN SUM, PROD
int64_t int64 AND, OR, XOR MAX, MIN SUM, PROD
uint8_t uint8 AND, OR, XOR MAX, MIN SUM, PROD
uint16_t uint16 AND, OR, XOR MAX, MIN SUM, PROD
uint32_t uint32 AND, OR, XOR MAX, MIN SUM, PROD
uint64_t uint64 AND, OR, XOR MAX, MIN SUM, PROD
size_t size AND, OR, XOR MAX, MIN SUM, PROD
float float   MAX, MIN SUM, PROD
double double   MAX, MIN SUM, PROD

AND

Performs a bitwise AND reduction across a set of PEs.

int nvshmem_TYPENAME_and_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_and_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_and_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_and_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_and_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer types supported for the AND operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

OR

Performs a bitwise OR reduction across a set of PEs.

int nvshmem_TYPENAME_or_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_or_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_or_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_or_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_or_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer types supported for the OR operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

XOR

Performs a bitwise exclusive OR (XOR) reduction across a set of PEs.

int nvshmem_TYPENAME_xor_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_xor_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_xor_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_xor_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_xor_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer types supported for the XOR operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

MAX

Performs a maximum-value reduction across a set of PEs.

int nvshmem_TYPENAME_max_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_max_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_max_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_max_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_max_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer or real types supported for the MAX operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

MIN

Performs a minimum-value reduction across a set of PEs.

int nvshmem_TYPENAME_min_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_min_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_min_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_min_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_min_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer or real types supported for the MIN operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

SUM

Performs a sum reduction across a set of PEs.

int nvshmem_TYPENAME_sum_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_sum_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_sum_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_sum_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_sum_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer, real, or complex types supported for the SUM operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

PROD

Performs a product reduction across a set of PEs.

int nvshmem_TYPENAME_prod_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
int nvshmemx_TYPENAME_prod_reduce_on_stream(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)
__device__ int nvshmem_TYPENAME_prod_reduce(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_prod_reduce_block(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
__device__ int nvshmemx_TYPENAME_prod_reduce_warp(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)

where TYPE is one of the integer, real, or complex types supported for the PROD operation as specified by Table Types, Names, and Operations for Team-Based Reductions.

team [IN]
The team over which to perform the operation.
dest [OUT]
Symmetric address of an array, of length nreduce elements, to receive the result of the reduction routines. The type of dest should match that implied in the SYNOPSIS section.
source [IN]
Symmetric address of an array, of length nreduce elements, that contains one element for each separate reduction routine. The type of source should match that implied in the SYNOPSIS section.
nreduce [IN]
The number of elements in the dest and source arrays.

Description

NVSHMEM reduction routines are collective routines over an active set that compute one or more reductions across symmetric arrays on multiple PEs. A reduction performs an associative binary routine across a set of values.

The nreduce argument determines the number of separate reductions to perform. The source array on all PEs participating in the reduction provides one element for each reduction. The results of the reductions are placed in the dest array on all PEs participating in the reduction.

The source and dest arguments must either be the same symmetric address, or two different symmetric addresses corresponding to buffers that do not overlap in memory. That is, they must be completely overlapping or completely disjoint.

Team-based reduction routines operate over all PEs in the provided team argument. All PEs in the provided team must participate in the reduction. If team compares equal to NVSHMEM_TEAM_INVALID or is otherwise invalid, the behavior is undefined.

The value of argument nreduce must be equal on all PEs in the active set.

Before any PE calls a reduction routine, the following conditions must be ensured:

  • The dest array on all PEs participating in the reduction is ready to accept the results of the reduction.

Otherwise, the behavior is undefined.

Upon return from a reduction routine, the following are true for the local PE:

  • The dest array is updated and the source array may be safely reused.

The complex-typed interfaces are only provided for sum and product reductions. When the C translation environment does not support complex types [1], an NVSHMEM implementation is not required to provide support for these complex-typed interfaces.

Returns

None.

[1]That is, under C language standards prior to C or under C when __STDC_NO_COMPLEX__ is defined to 1