Collective Communication¶
Collective routines are defined as coordinated communication or synchronization operations performed by a group of PEs.
NVSHMEM provides the following types of collective routines:
- Collective routines that operate on teams use a team handle parameter to determine which PEs will participate in the routine, and use resources encapsulated by the team object to perform operations. See Section Team Management for details on team management.
- Collective routines that accept no team parameters implicitly operate on all PEs.
Concurrent accesses to symmetric memory by an NVSHMEM collective routine and any other means of access—where at least one updates the symmetric memory—results in undefined behavior. Since PEs can enter and exit collectives at different times, accessing such memory remotely may require additional synchronization.
Team-based collectives¶
The team-based collective routines are performed with respect to a valid
NVSHMEM team, which is specified by a team handle argument. Team-based
collective operations require all PEs in the team to call the routine in
order for the operation to complete. If an invalid team handle or
NVSHMEM_TEAM_INVALID
is passed to a team-based collective routine,
the behavior is undefined.
All NVSHMEM teams-based collective operations are blocking routines. On return from a team-based collective call, the PE may immediately call another collective operation on that same team. Team-based collectives must occur in the same program order across all PEs in a team.
While NVSHMEM routines provide thread support according to the thread-support level provided at initialization (see Section Thread Support), team-based collective routines may not be called simultaneously by multiple threads on a given team.
The team-based collective routines defined in the NVSHMEM Specification are:
nvshmem_team_sync
nvshmem_[TYPENAME_]alltoall
nvshmem_[TYPENAME_]broadcast
nvshmem_[TYPENAME_]collect
nvshmem_[TYPENAME_]{and, or, xor, max, min, sum, prod}_reduce
In addition, all team creation functions are collective operations. In addition to the ordering and thread safety requirements described here, there are additional synchronization requirements on team creation operations. See Section Team Management for more details.
Implicit team collectives¶
Some NVSHMEM collective routines implicitly operate on all PEs. These routines include:
nvshmem_sync_all
, which synchronizes all PEs in the computation.nvshmem_barrier_all
, which synchronizes all PEs in the computation and ensures completion of all local and remote memory updates.- NVSHMEM memory-management routines, which imply one or more calls to
a routine equivalent to
nvshmem_barrier_all
.
Error codes returned from team-based collectives¶
Collective operations involving multiple PEs may return values indicating success while other PEs are still executing the collective operation. Return values indicating success or failure of a collective routine on one PE may not indicate that all PEs involved in the collective operation will return the same value. Some operations, such as team creation, must return identical return codes across multiple PEs.
NVSHMEM_BARRIER_ALL¶
-
void
nvshmem_barrier_all
(void)¶
-
void
nvshmemx_barrier_all_on_stream
(void, cudaStream_t stream)¶
-
__device__ void
nvshmem_barrier_all
(void)
-
__device__ void
nvshmemx_barrier_all_block
(void)¶
-
__device__ void
nvshmemx_barrier_all_warp
(void)¶
Description
The nvshmem_barrier_all
routine is a mechanism for synchronizing all
PEs at once. This routine blocks the calling PE until all PEs have
called nvshmem_barrier_all
. In a multithreaded NVSHMEM program, only
the calling thread is blocked, however, it may not be called
concurrently by multiple threads in the same PE.
Prior to synchronizing with other PEs, nvshmem_barrier_all
ensures
completion of all previously issued memory stores and remote memory
updates issued NVSHMEMAMOs and RMA routine calls such as
nvshmem_int_add
, nvshmem_put32
, nvshmem_put_nbi
, and
nvshmem_get_nbi
.
Returns
None.
Notes
The nvshmem_barrier_all
routine can be used to portably ensure that
memory access operations observe remote updates in the order enforced by
initiator PEs.
Ordering APIs (nvshmem_fence
, nvshmem_quiet
,
nvshmem_barrier
, and nvshmem_barrier_all
) issued on the CPU and
the GPU only order communication operations that were issued from the
CPU and the GPU, respectively. To ensure completion of GPU-side
operations from the CPU, the developer must perform a GPU-side quiet
operation and ensure completion of the CUDA kernel from which the
GPU-side operations were issued, using operations like
cudaStreamSynchronize
or cudaDeviceSynchronize
. Alternatively, a
stream-based quiet operation can be used. Stream-based quiet operations
have the effect of a quiet being executed on the GPU in stream order,
ensuring completion and ordering of only GPU-side operations.
NVSHMEM_BARRIER¶
-
void
nvshmem_barrier
(nvshmem_team_t team)¶
-
void
nvshmemx_barrier_on_stream
(nvshmem_team_t team, cudaStream_t stream)¶
-
__device__ void
nvshmem_barrier
(nvshmem_team_t team)
-
__device__ void
nvshmemx_barrier_block
(nvshmem_team_t team)¶
-
__device__ void
nvshmemx_barrier_warp
(nvshmem_team_t team)¶
- team [IN]
- The lowest PE number of the active set of PEs.
Description
nvshmem_barrier
is a collective synchronization routine over a team.
Control returns from nvshmem_barrier
after all PEs in the team have
called nvshmem_barrier
.
As with all NVSHMEM collective routines, each of these routines assumes that only PEs in the active set call the routine. If a PE not in the active set calls an NVSHMEM collective routine, the behavior is undefined.
nvshmem_barrier
ensures that all previously issued stores and remote
memory updates, including AMOs and RMA operations, done by any of the
PEs in the active set are complete before returning. On systems with
only NVLink, updates are globally visible, whereas on systems with both
NVLink and InfiniBand, NVSHMEM only guarantees that updates to the
memory of a given PE are visible to that PE.
Returns
None.
Notes
The nvshmem_barrier
routine can be used to portably ensure that
memory access operations observe remote updates in the order enforced by
initiator PEs.
Ordering APIs (nvshmem_fence
, nvshmem_quiet
,
nvshmem_barrier
, and nvshmem_barrier_all
) issued on the CPU and
the GPU only order communication operations that were issued from the
CPU and the GPU, respectively. To ensure completion of GPU-side
operations from the CPU, the developer must perform a GPU-side quiet
operation and ensure completion of the CUDA kernel from which the
GPU-side operations were issued, using operations like
cudaStreamSynchronize
or cudaDeviceSynchronize
. Alternatively, a
stream-based quiet operation can be used. Stream-based quiet operations
have the effect of a quiet being executed on the GPU in stream order,
ensuring completion and ordering of only GPU-side operations.
NVSHMEM_SYNC¶
-
int
nvshmem_sync
(nvshmem_team_t team)¶
-
int
nvshmemx_sync_on_stream
(nvshmem_team_t team, cudaStream_t stream)¶
-
__device__ int
nvshmem_sync
(nvshmem_team_t team)
-
__device__ int
nvshmemx_sync_block
(nvshmem_team_t team)¶
-
__device__ int
nvshmemx_sync_warp
(nvshmem_team_t team)¶
-
int
nvshmem_team_sync
(nvshmem_team_t team)¶
-
int
nvshmemx_team_sync_on_stream
(nvshmem_team_t team, cudaStream_t stream)¶
-
__device__ int
nvshmem_team_sync
(nvshmem_team_t team)
-
__device__ int
nvshmemx_team_sync_block
(nvshmem_team_t team)¶
-
__device__ int
nvshmemx_team_sync_warp
(nvshmem_team_t team)¶
- team [IN]
- The team over which to perform the operation.
Description
nvshmem_sync
is a collective synchronization routine over a team.
The routine registers the arrival of a PE at a synchronization point in
the program. This is a fast mechanism for synchronizing all PEs that
participate in this collective call. The routine blocks the calling PE
until all PEs in the specified team have called nvshmem_sync
. In a
multithreaded NVSHMEM program, only the calling thread is blocked.
Team-based sync routines operate over all PEs in the provided team
argument. All PEs in the provided team must participate in the sync
operation. If team
compares equal to NVSHMEM_TEAM_INVALID
or is
otherwise invalid, the behavior is undefined.
In contrast with the nvshmem_barrier
routine, nvshmem_sync
only
ensures completion and visibility of previously issued memory stores and
does not ensure completion of remote memory updates issued via NVSHMEM
routines.
Returns
Zero on successful local completion. Nonzero otherwise.
Notes
The nvshmem_sync
routine can be used to portably ensure that memory
access operations observe remote updates in the order enforced by the
initiator PEs, provided that the initiator PE ensures completion of
remote updates with a call to nvshmem_quiet
prior to the call to the
nvshmem_sync
routine.
NVSHMEM_SYNC_ALL¶
-
void
nvshmem_sync_all
(void)¶
-
void
nvshmemx_sync_all_on_stream
(void, cudaStream_t stream)¶
-
__device__ void
nvshmem_sync_all
(void)
-
__device__ void
nvshmemx_sync_all_block
(void)¶
-
__device__ void
nvshmemx_sync_all_warp
(void)¶
Description
This routine blocks the calling PE until all PEs in the world team have
called nvshmem_sync_all
.
In a multithreaded NVSHMEM program, only the calling thread is blocked.
In contrast with the nvshmem_barrier_all
routine,
nvshmem_sync_all
only ensures completion and visibility of
previously issued memory stores and does not ensure completion of remote
memory updates issued via NVSHMEM routines.
Returns
None.
Notes
The nvshmem_sync_all
routine is equivalent to calling
nvshmem_team_sync
on the world team.
NVSHMEM_ALLTOALL¶
-
int
nvshmem_TYPENAME_alltoall
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)¶
-
int
nvshmemx_TYPENAME_alltoall_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_alltoall
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
-
__device__ int
nvshmemx_TYPENAME_alltoall_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)¶
-
__device__ int
nvshmemx_TYPENAME_alltoall_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types and Names.
- team [IN]
- A valid NVSHMEM team handle to a team.
- dest [OUT]
- Symmetric address of a data object large enough to receive the
combined total of
nelems
elements from each PE in the team. The type ofdest
should match that implied in the SYNOPSIS section. - source [IN]
- Symmetric address of a data object that contains
nelems
elements of data for each PE in the team, ordered according to destination PE. The type ofsource
should match that implied in the SYNOPSIS section. - nelems [IN]
- The number of elements to exchange for each PE. For
nvshmem_alltoallmem
, elements are bytes; fornvshmem_alltoall{32,64}
, elements are 4 or 8 bytes, respectively.
Description
The nvshmem_alltoall
routines are collective routines. Each PE
participating in the operation exchanges nelems
data elements with
all other PEs participating in the operation.
The data being sent and received are stored in a contiguous symmetric
data object. The total size of each PE’s source
object and dest
object is nelems
times the size of an element times N
, where
N
equals the number of PEs participating in the operation. The
source
object contains N
blocks of data (where the size of each
block is defined by nelems
) and each block of data is sent to a
different PE.
The same dest
and source
arrays, and same value for nelems must
be passed by all PEs that participate in the collective.
Given a PE i
that is the \({\textit{k}^{\text{\tiny th}}}\) PE
participating in the operation and a PE j
that is the
\({\textit{l}^{\text{\tiny th}}}\) PE participating in the
operation,
PE i
sends the \({\textit{l}^{\text{\tiny th}}}\) block of its
source
object to the \({\textit{k}^{\text{\tiny th}}}\) block
of the dest
object of PE j
.
Team-based collect routines operate over all PEs in the provided team
argument. All PEs in the provided team must participate in the
collective. If team
compares equal to NVSHMEM_TEAM_INVALID
or is
otherwise invalid, the behavior is undefined.
Before any PE calls a nvshmem_alltoall
routine, the following
conditions must be ensured:
- The
dest
data object on all PEs in the active set is ready to accept thenvshmem_alltoall
data.
Otherwise, the behavior is undefined.
Upon return from a nvshmem_alltoall
routine, the following is true
for the local PE:
- Its
dest
symmetric data object is completely updated and the data has been copied out of thesource
data object.
Returns
Zero on successful local completion. Nonzero otherwise.
NVSHMEM_BROADCAST¶
-
int
nvshmem_TYPENAME_broadcast
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)¶
-
int
nvshmemx_TYPENAME_broadcast_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_broadcast
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)
-
__device__ int
nvshmemx_TYPENAME_broadcast_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)¶
-
__device__ int
nvshmemx_TYPENAME_broadcast_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, int PE_root)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types and Names.
- team [IN]
- The team over which to perform the operation.
- dest [OUT]
- Symmetric address of destination data object. The type of
dest
should match that implied in the SYNOPSIS section. - source [IN]
- Symmetric address of the source data object. The type of
source
should match that implied in the SYNOPSIS section. - nelems [IN]
- The number of elements in
source
anddest
arrays. Fornvshmem_broadcastmem
, elements are bytes; fornvshmem_broadcast{32,64}
, elements are 4 or 8 bytes, respectively. - PE_root [IN]
- Zero-based ordinal of the PE, with respect to the team, from which the data is copied.
Description
NVSHMEM broadcast routines are collective routines over a team. They
copy the source
data object on the PE specified by PE_root
to
the dest
data object on the PEs participating in the collective
operation. The same dest
and source
data objects and the same
value of PE_root
must be passed by all PEs participating in the
collective operation.
For team-based broadcasts:
- The
dest
object is updated on all PEs. - All PEs in the
team
argument must participate in the operation. - If
team
compares equal toNVSHMEM_TEAM_INVALID
or is otherwise invalid, the behavior is undefined. - PE numbering is relative to the team. The specified root PE must be a
valid PE number for the team, between
0
andN-1
, whereN
is the size of the team.
Before any PE calls a broadcast routine, the following conditions must be ensured:
- The
dest
array on all PEs participating in the broadcast is ready to accept the broadcast data.
Otherwise, the behavior is undefined.
Upon return from a broadcast routine, the following are true for the local PE:
- For team-based broadcasts, the
dest
data object is updated. - The
source
data object may be safely reused.
Returns
For team-based broadcasts, zero on successful local completion; otherwise, nonzero.
Notes
Team handle error checking and integer return codes are currently
undefined. Implementations may define these behaviors as needed, but
programs should ensure portability by doing their own checks for invalid
team handles and for NVSHMEM_TEAM_INVALID
.
NVSHMEM_COLLECT¶
-
int
nvshmem_TYPENAME_fcollect
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)¶
-
int
nvshmemx_TYPENAME_fcollect_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_fcollect
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)
-
__device__ int
nvshmemx_TYPENAME_fcollect_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)¶
-
__device__ int
nvshmemx_TYPENAME_fcollect_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nelems)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table Standard RMA Types and Names.
- team [IN]
- A valid NVSHMEM team handle.
- dest [OUT]
- Symmetric address of an array large enough to accept the
concatenation of the
source
arrays on all participating PEs. The type ofdest
should match that implied in the SYNOPSIS section. - source [IN]
- Symmetric address of the source data object. The type of
source
should match that implied in the SYNOPSIS section. - nelems [IN]
- The number of elements in
source
array. Fornvshmem_[f]collectmem
, elements are bytes; fornvshmem_[f]collect{32,64}
, elements are 4 or 8 bytes, respectively.
Description
NVSHMEM fcollect
routines perform a collective operation to
concatenate nelems
data items from the source
array into the
dest
array, over a team in processor number order. The resultant
dest
array contains the contribution from PEs as follows:
- For a team, the data from PE number
0
in the team is first, then the contribution from PE1
in the team, and so on.
The collected result is written to the dest
array for all PEs that
participate in the operation. The same dest
and source
arrays
must be passed by all PEs that participate in the operation.
The
Team-based collect routines operate over all PEs in the provided team
argument. All PEs in the provided team must participate in the
operation. If team
compares equal to NVSHMEM_TEAM_INVALID
or is
otherwise invalid, the behavior is undefined.
Upon return from a collective routine, the following are true for the local PE:
- The
dest
array is updated and thesource
array may be safely reused.
Returns
Zero on successful local completion. Nonzero otherwise.
NVSHMEM_REDUCTIONS¶
TYPE | TYPENAME | |||
---|---|---|---|---|
char | char | MAX, MIN | SUM, PROD | |
signed char | schar | MAX, MIN | SUM, PROD | |
short | short | MAX, MIN | SUM, PROD | |
int | int | MAX, MIN | SUM, PROD | |
long | long | MAX, MIN | SUM, PROD | |
long long | longlong | MAX, MIN | SUM, PROD | |
ptrdiff_t | ptrdiff | MAX, MIN | SUM, PROD | |
unsigned char | uchar | AND, OR, XOR | MAX, MIN | SUM, PROD |
unsigned short | ushort | AND, OR, XOR | MAX, MIN | SUM, PROD |
unsigned int | uint | AND, OR, XOR | MAX, MIN | SUM, PROD |
unsigned long | ulong | AND, OR, XOR | MAX, MIN | SUM, PROD |
unsigned long long | ulonglong | AND, OR, XOR | MAX, MIN | SUM, PROD |
int8_t | int8 | AND, OR, XOR | MAX, MIN | SUM, PROD |
int16_t | int16 | AND, OR, XOR | MAX, MIN | SUM, PROD |
int32_t | int32 | AND, OR, XOR | MAX, MIN | SUM, PROD |
int64_t | int64 | AND, OR, XOR | MAX, MIN | SUM, PROD |
uint8_t | uint8 | AND, OR, XOR | MAX, MIN | SUM, PROD |
uint16_t | uint16 | AND, OR, XOR | MAX, MIN | SUM, PROD |
uint32_t | uint32 | AND, OR, XOR | MAX, MIN | SUM, PROD |
uint64_t | uint64 | AND, OR, XOR | MAX, MIN | SUM, PROD |
size_t | size | AND, OR, XOR | MAX, MIN | SUM, PROD |
float | float | MAX, MIN | SUM, PROD | |
double | double | MAX, MIN | SUM, PROD |
AND¶
Performs a bitwise AND reduction across a set of PEs.
-
int
nvshmem_TYPENAME_and_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_and_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_and_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_and_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_and_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer types supported for the AND operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
OR¶
Performs a bitwise OR reduction across a set of PEs.
-
int
nvshmem_TYPENAME_or_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_or_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_or_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_or_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_or_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer types supported for the OR operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
XOR¶
Performs a bitwise exclusive OR (XOR) reduction across a set of PEs.
-
int
nvshmem_TYPENAME_xor_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_xor_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_xor_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_xor_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_xor_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer types supported for the XOR operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
MAX¶
Performs a maximum-value reduction across a set of PEs.
-
int
nvshmem_TYPENAME_max_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_max_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_max_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_max_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_max_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer or real types supported for the MAX operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
MIN¶
Performs a minimum-value reduction across a set of PEs.
-
int
nvshmem_TYPENAME_min_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_min_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_min_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_min_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_min_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer or real types supported for the MIN operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
SUM¶
Performs a sum reduction across a set of PEs.
-
int
nvshmem_TYPENAME_sum_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_sum_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_sum_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_sum_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_sum_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer, real, or complex types supported for the SUM operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
PROD¶
Performs a product reduction across a set of PEs.
-
int
nvshmem_TYPENAME_prod_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
int
nvshmemx_TYPENAME_prod_reduce_on_stream
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce, cudaStream_t stream)¶
-
__device__ int
nvshmem_TYPENAME_prod_reduce
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)
-
__device__ int
nvshmemx_TYPENAME_prod_reduce_block
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
-
__device__ int
nvshmemx_TYPENAME_prod_reduce_warp
(nvshmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce)¶
where TYPE is one of the integer, real, or complex types supported for the PROD operation as specified by Table Types, Names, and Operations for Team-Based Reductions.
- team [IN]
- The team over which to perform the operation.
- dest [OUT]
- Symmetric address of an array, of length
nreduce
elements, to receive the result of the reduction routines. The type ofdest
should match that implied in the SYNOPSIS section. - source [IN]
- Symmetric address of an array, of length
nreduce
elements, that contains one element for each separate reduction routine. The type ofsource
should match that implied in the SYNOPSIS section. - nreduce [IN]
- The number of elements in the
dest
andsource
arrays.
Description
NVSHMEM reduction routines are collective routines over an active set that compute one or more reductions across symmetric arrays on multiple PEs. A reduction performs an associative binary routine across a set of values.
The nreduce
argument determines the number of separate reductions to
perform. The source
array on all PEs participating in the reduction
provides one element for each reduction. The results of the reductions
are placed in the dest
array on all PEs participating in the
reduction.
The source
and dest
arguments must either be the same symmetric
address, or two different symmetric addresses corresponding to buffers
that do not overlap in memory. That is, they must be completely
overlapping or completely disjoint.
Team-based reduction routines operate over all PEs in the provided team
argument. All PEs in the provided team must participate in the
reduction. If team
compares equal to NVSHMEM_TEAM_INVALID
or is
otherwise invalid, the behavior is undefined.
The value of argument nreduce
must be equal on all PEs in the active
set.
Before any PE calls a reduction routine, the following conditions must be ensured:
- The
dest
array on all PEs participating in the reduction is ready to accept the results of the reduction.
Otherwise, the behavior is undefined.
Upon return from a reduction routine, the following are true for the local PE:
- The
dest
array is updated and thesource
array may be safely reused.
The complex-typed interfaces are only provided for sum and product reductions. When the C translation environment does not support complex types [1], an NVSHMEM implementation is not required to provide support for these complex-typed interfaces.
Returns
None.
[1] | That is, under C language standards prior to C or under C when
__STDC_NO_COMPLEX__ is defined to 1 |