Memory Ordering¶
The following section discusses NVSHMEMAPIs that provide mechanisms to ensure ordering and/or delivery of completion on memory store, blocking, and nonblocking NVSHMEM routines. Table [mem-order] lists the operations affected by NVSHMEM memory ordering routines.
Operations | Fence | Quiet |
Memory Store | X | X |
Blocking Put | X | X |
Blocking Get | ||
Blocking AMO | X | X |
Non-blocking Put | X | X |
Non-blocking Get | X | |
Non-blocking AMO | X [1] | X |
NVSHMEM_FENCE¶
-
void
nvshmem_fence
(void)¶
-
__device__ void
nvshmem_fence
(void)
Description
This routine ensures ordering of delivery of operations on symmetric
data objects. Table [mem-order] lists the operations
that are ordered by the nvshmem_fence
routine. All operations on
symmetric data objects issued to a particular PE prior to the call to
nvshmem_fence
are guaranteed to be delivered before any subsequent
operations on symmetric data objects to the same PE. nvshmem_fence
guarantees order of delivery, not completion. It does not guarantee
order of delivery of nonblocking Get or values fetched by nonblocking
AMO routines.
Fence operations issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively.
Returns
None.
Notes
nvshmem_fence
only provides per-PE ordering guarantees and does not
guarantee completion of delivery. nvshmem_fence
also does not have
an effect on the ordering between memory accesses issued by the target
PE. nvshmem_wait_until
, nvshmem_test
, nvshmem_barrier
,
nvshmem_barrier_all
routines can be called by the target PE to
guarantee ordering of its memory accesses. There is a subtle difference
between nvshmem_fence
and nvshmem_quiet
, in that,
nvshmem_quiet
guarantees completion of all operations on symmetric
data objects which makes the updates visible to all other PEs.
The nvshmem_quiet
routine should be called if completion of
operations on symmetric data objects is desired when multiple PEs are
involved.
In an NVSHMEM program with multithreaded PEs, it is the user’s
responsibility to ensure ordering between operations issued by the
threads in a PE that target symmetric memory and calls by threads in
that PE to nvshmem_fence
. The nvshmem_fence
routine can enforce
memory store ordering only for the calling thread. Thus, to ensure
ordering for memory stores performed by a thread that is not the thread
calling nvshmem_fence
, the update must be made visible to the
calling thread according to the rules of the memory model associated
with the threading environment.
See Ring Broadcast Example for example usage of nvshmem_fence
.
NVSHMEM_QUIET¶
-
void
nvshmem_quiet
(void)¶
-
void
nvshmemx_quiet_on_stream
(void, cudaStream_t stream)¶
-
__device__ void
nvshmem_quiet
(void)
Description
The nvshmem_quiet
routine ensures completion of all operations on
symmetric data objects issued by the calling PE.
Table [mem-order] lists the operations for which the
nvshmem_quiet
routine ensures completion. On systems with only
NVLink, all operations on symmetric data objects are guaranteed to be
complete and visible to all PEs when nvshmem_quiet
returns. On
systems with both NVLink and InfiniBand, visibility is only guaranteed
at the destination PE.
Quiet operations issued on the CPU and the GPU only complete
communication operations that were issued from the CPU and the GPU,
respectively. To ensure completion of GPU-side operations from the CPU,
the developer must perform a GPU-side quiet operation and ensure
completion of the CUDA kernel from which the GPU-side operations were
issued, using operations like cudaStreamSynchronize
or
cudaDeviceSynchronize
. Alternatively, a stream-based quiet operation
can be used. Stream-based quiet operations have the effect of a quiet
being executed on the GPU in stream order, ensuring completion and
ordering of only GPU-side operations.
Returns
None.
Notes
nvshmem_quiet
is most useful as a way of ensuring completion of
several operations on symmetric data objects initiated by the calling
PE. For example, one might use nvshmem_quiet
to await delivery of a
block of data before issuing another Put or nonblocking Put routine,
which sets a completion flag on another PE. nvshmem_quiet
is not
usually needed if nvshmem_barrier_all
or nvshmem_barrier
are
called. The barrier routines wait for the completion of outstanding
operations to symmetric data objects on all PEs.
In an NVSHMEM program with multithreaded PEs, it is the user’s
responsibility to ensure ordering between operations issued by the
threads in a PE that target symmetric memory and calls by threads in
that PE to nvshmem_quiet
. The nvshmem_quiet
routine can enforce
memory store ordering only for the calling thread. Thus, to ensure
ordering for memory stores performed by a thread that is not the thread
calling nvshmem_quiet
, the update must be made visible to the
calling thread according to the rules of the memory model associated
with the threading environment.
A call to nvshmem_quiet
by a thread completes the operations posted
prior to calling nvshmem_quiet
. If the user intends to also complete
operations issued by a thread that is not the thread calling
nvshmem_quiet
, the user must ensure that the operations are
performed prior to the call to nvshmem_quiet
. This may require the
use of a synchronization operation provided by the threading package.
For example, when using POSIX Threads, the user may call the
pthread_barrier_wait
routine to ensure that all threads have issued
operations before a thread calls nvshmem_quiet
.
nvshmem_quiet
does not have an effect on the ordering between memory
accesses issued by the target PE. nvshmem_wait_until
,
nvshmem_test
, nvshmem_barrier
, nvshmem_barrier_all
routines
can be called by the target PE to guarantee ordering of its memory
accesses.
[1] | NVSHMEM fence routines does not guarantee order of delivery of values fetched by nonblocking AMO routines. |