Memory Ordering

The following section discusses NVSHMEMAPIs that provide mechanisms to ensure ordering and/or delivery of completion on memory store, blocking, and nonblocking NVSHMEM routines. Table [mem-order] lists the operations affected by NVSHMEM memory ordering routines.

Operations affected by NVSHMEM Memory Ordering routines
Operations Fence Quiet
Memory Store X X
Blocking Put X X
Blocking Get    
Blocking AMO X X
Non-blocking Put X X
Non-blocking Get   X
Non-blocking AMO X [1] X

NVSHMEM_FENCE

void nvshmem_fence(void)
void nvshmemx_fence_on_stream(void, cudaStream_t stream)
__device__ void nvshmem_fence(void)

Description

This routine ensures ordering of delivery of operations on symmetric data objects. Table [mem-order] lists the operations that are ordered by the nvshmem_fence routine. All operations on symmetric data objects issued to a particular PE prior to the call to nvshmem_fence are guaranteed to be delivered before any subsequent operations on symmetric data objects to the same PE. nvshmem_fence guarantees order of delivery, not completion. It does not guarantee order of delivery of nonblocking Get or values fetched by nonblocking AMO routines.

Fence operations issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively.

Returns

None.

Notes

nvshmem_fence only provides per-PE ordering guarantees and does not guarantee completion of delivery. nvshmem_fence also does not have an effect on the ordering between memory accesses issued by the target PE. nvshmem_wait_until, nvshmem_test, nvshmem_barrier, nvshmem_barrier_all routines can be called by the target PE to guarantee ordering of its memory accesses. There is a subtle difference between nvshmem_fence and nvshmem_quiet, in that, nvshmem_quiet guarantees completion of all operations on symmetric data objects which makes the updates visible to all other PEs.

The nvshmem_quiet routine should be called if completion of operations on symmetric data objects is desired when multiple PEs are involved.

In an NVSHMEM program with multithreaded PEs, it is the user’s responsibility to ensure ordering between operations issued by the threads in a PE that target symmetric memory and calls by threads in that PE to nvshmem_fence. The nvshmem_fence routine can enforce memory store ordering only for the calling thread. Thus, to ensure ordering for memory stores performed by a thread that is not the thread calling nvshmem_fence, the update must be made visible to the calling thread according to the rules of the memory model associated with the threading environment.

See Ring Broadcast Example for example usage of nvshmem_fence.

NVSHMEM_QUIET

void nvshmem_quiet(void)
void nvshmemx_quiet_on_stream(void, cudaStream_t stream)
__device__ void nvshmem_quiet(void)

Description

The nvshmem_quiet routine ensures completion of all operations on symmetric data objects issued by the calling PE. Table [mem-order] lists the operations for which the nvshmem_quiet routine ensures completion. On systems with only NVLink, all operations on symmetric data objects are guaranteed to be complete and visible to all PEs when nvshmem_quiet returns. On systems with both NVLink and InfiniBand, visibility is only guaranteed at the destination PE.

Quiet operations issued on the CPU and the GPU only complete communication operations that were issued from the CPU and the GPU, respectively. To ensure completion of GPU-side operations from the CPU, the developer must perform a GPU-side quiet operation and ensure completion of the CUDA kernel from which the GPU-side operations were issued, using operations like cudaStreamSynchronize or cudaDeviceSynchronize. Alternatively, a stream-based quiet operation can be used. Stream-based quiet operations have the effect of a quiet being executed on the GPU in stream order, ensuring completion and ordering of only GPU-side operations.

Returns

None.

Notes

nvshmem_quiet is most useful as a way of ensuring completion of several operations on symmetric data objects initiated by the calling PE. For example, one might use nvshmem_quiet to await delivery of a block of data before issuing another Put or nonblocking Put routine, which sets a completion flag on another PE. nvshmem_quiet is not usually needed if nvshmem_barrier_all or nvshmem_barrier are called. The barrier routines wait for the completion of outstanding operations to symmetric data objects on all PEs.

In an NVSHMEM program with multithreaded PEs, it is the user’s responsibility to ensure ordering between operations issued by the threads in a PE that target symmetric memory and calls by threads in that PE to nvshmem_quiet. The nvshmem_quiet routine can enforce memory store ordering only for the calling thread. Thus, to ensure ordering for memory stores performed by a thread that is not the thread calling nvshmem_quiet, the update must be made visible to the calling thread according to the rules of the memory model associated with the threading environment.

A call to nvshmem_quiet by a thread completes the operations posted prior to calling nvshmem_quiet. If the user intends to also complete operations issued by a thread that is not the thread calling nvshmem_quiet, the user must ensure that the operations are performed prior to the call to nvshmem_quiet. This may require the use of a synchronization operation provided by the threading package. For example, when using POSIX Threads, the user may call the pthread_barrier_wait routine to ensure that all threads have issued operations before a thread calls nvshmem_quiet.

nvshmem_quiet does not have an effect on the ordering between memory accesses issued by the target PE. nvshmem_wait_until, nvshmem_test, nvshmem_barrier, nvshmem_barrier_all routines can be called by the target PE to guarantee ordering of its memory accesses.

[1]NVSHMEM fence routines does not guarantee order of delivery of values fetched by nonblocking AMO routines.