.. _subsec:memory_order: Memory Ordering --------------- The following section discusses NVSHMEMAPIs that provide mechanisms to ensure ordering and/or delivery of completion on memory store, blocking, and nonblocking NVSHMEM routines. Table `[mem-order] <#mem-order>`__ lists the operations affected by NVSHMEM memory ordering routines. ================== ====== ===== Operations Fence Quiet Memory Store X X Blocking *Put* X X Blocking *Get* Blocking *AMO* X X Non-blocking *Put* X X Non-blocking *Get* X Non-blocking *AMO* X [1]_ X ================== ====== ===== .. _subsec:shmem_fence: **NVSHMEM_FENCE** ~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_fence(void) .. c:function:: __device__ void nvshmem\_fence(void) **Description** This routine ensures ordering of delivery of operations on symmetric data objects. Table `[mem-order] <#mem-order>`__ lists the operations that are ordered by the ``nvshmem_fence`` routine. All operations on symmetric data objects issued to a particular PE prior to the call to ``nvshmem_fence`` are guaranteed to be delivered before any subsequent operations on symmetric data objects to the same PE. ``nvshmem_fence`` guarantees order of delivery, not completion. It does not guarantee order of delivery of nonblocking *Get* or values fetched by nonblocking AMO routines. Fence operations issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively. **Returns** None. **Notes** ``nvshmem_fence`` only provides per-PE ordering guarantees and does not guarantee completion of delivery. ``nvshmem_fence`` also does not have an effect on the ordering between memory accesses issued by the target PE. ``nvshmem_wait_until``, ``nvshmem_test``, ``nvshmem_barrier``, ``nvshmem_barrier_all`` routines can be called by the target PE to guarantee ordering of its memory accesses. There is a subtle difference between ``nvshmem_fence`` and ``nvshmem_quiet``, in that, ``nvshmem_quiet`` guarantees completion of all operations on symmetric data objects which makes the updates visible to all other PEs. The ``nvshmem_quiet`` routine should be called if completion of operations on symmetric data objects is desired when multiple PEs are involved. In an NVSHMEM program with multithreaded PEs, it is the user’s responsibility to ensure ordering between operations issued by the threads in a PE that target symmetric memory and calls by threads in that PE to ``nvshmem_fence``. The ``nvshmem_fence`` routine can enforce memory store ordering only for the calling thread. Thus, to ensure ordering for memory stores performed by a thread that is not the thread calling ``nvshmem_fence``, the update must be made visible to the calling thread according to the rules of the memory model associated with the threading environment. .. _subsec:shmem_quiet: **NVSHMEM_QUIET** ~~~~~~~~~~~~~~~~~ .. c:function:: void nvshmem\_quiet(void) .. c:function:: void nvshmemx\_quiet_on_stream(void, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_quiet(void) **Description** The ``nvshmem_quiet`` routine ensures completion of all operations on symmetric data objects issued by the calling PE. Table `[mem-order] <#mem-order>`__ lists the operations for which the ``nvshmem_quiet`` routine ensures completion. On systems with only NVLink, all operations on symmetric data objects are guaranteed to be complete and visible to all PEs when ``nvshmem_quiet`` returns. On systems with both NVLink and InfiniBand, visibility is only guaranteed at the destination PE. Quiet operations issued on the CPU and the GPU only complete communication operations that were issued from the CPU and the GPU, respectively. To ensure completion of GPU-side operations from the CPU, the developer must perform a GPU-side quiet operation and ensure completion of the CUDA kernel from which the GPU-side operations were issued, using operations like ``cudaStreamSynchronize`` or ``cudaDeviceSynchronize``. Alternatively, a stream-based quiet operation can be used. Stream-based quiet operations have the effect of a quiet being executed on the GPU in stream order, ensuring completion and ordering of only GPU-side operations. **Returns** None. **Notes** ``nvshmem_quiet`` is most useful as a way of ensuring completion of several operations on symmetric data objects initiated by the calling PE. For example, one might use ``nvshmem_quiet`` to await delivery of a block of data before issuing another *Put* or nonblocking *Put* routine, which sets a completion flag on another PE. ``nvshmem_quiet`` is not usually needed if ``nvshmem_barrier_all`` or ``nvshmem_barrier`` are called. The barrier routines wait for the completion of outstanding operations to symmetric data objects on all PEs. In an NVSHMEM program with multithreaded PEs, it is the user’s responsibility to ensure ordering between operations issued by the threads in a PE that target symmetric memory and calls by threads in that PE to ``nvshmem_quiet``. The ``nvshmem_quiet`` routine can enforce memory store ordering only for the calling thread. Thus, to ensure ordering for memory stores performed by a thread that is not the thread calling ``nvshmem_quiet``, the update must be made visible to the calling thread according to the rules of the memory model associated with the threading environment. A call to ``nvshmem_quiet`` by a thread completes the operations posted prior to calling ``nvshmem_quiet``. If the user intends to also complete operations issued by a thread that is not the thread calling ``nvshmem_quiet``, the user must ensure that the operations are performed prior to the call to ``nvshmem_quiet``. This may require the use of a synchronization operation provided by the threading package. For example, when using POSIX Threads, the user may call the ``pthread_barrier_wait`` routine to ensure that all threads have issued operations before a thread calls ``nvshmem_quiet``. ``nvshmem_quiet`` does not have an effect on the ordering between memory accesses issued by the target PE. ``nvshmem_wait_until``, ``nvshmem_test``, ``nvshmem_barrier``, ``nvshmem_barrier_all`` routines can be called by the target PE to guarantee ordering of its memory accesses. .. [1] NVSHMEM fence routines does not guarantee order of delivery of values fetched by nonblocking AMO routines.