.. _subsec:memory_order:

Memory Ordering
---------------

The following section discusses NVSHMEMAPIs that provide mechanisms to
ensure ordering and/or delivery of completion on memory store, blocking,
and nonblocking NVSHMEM routines. Table `[mem-order] <#mem-order>`__
lists the operations affected by NVSHMEM memory ordering routines.

================== ====== =====
Operations         Fence  Quiet
Memory Store       X      X
Blocking *Put*     X      X
Blocking *Get*            
Blocking *AMO*     X      X
Non-blocking *Put* X      X
Non-blocking *Get*        X
Non-blocking *AMO* X [1]_ X
================== ====== =====

.. _subsec:shmem_fence:

**NVSHMEM_FENCE**
~~~~~~~~~~~~~~~~~


.. c:function:: void nvshmem\_fence(void)


.. c:function:: __device__ void nvshmem\_fence(void)

**Description**

This routine ensures ordering of delivery of operations on symmetric
data objects. Table `[mem-order] <#mem-order>`__ lists the operations
that are ordered by the ``nvshmem_fence`` routine. All operations on
symmetric data objects issued to a particular PE prior to the call to
``nvshmem_fence`` are guaranteed to be delivered before any subsequent
operations on symmetric data objects to the same PE. ``nvshmem_fence``
guarantees order of delivery, not completion. It does not guarantee
order of delivery of nonblocking *Get* or values fetched by nonblocking
AMO routines.

Fence operations issued on the CPU and the GPU only order communication
operations that were issued from the CPU and the GPU, respectively.

**Returns**

None.

**Notes**

``nvshmem_fence`` only provides per-PE ordering guarantees and does not
guarantee completion of delivery. ``nvshmem_fence`` also does not have
an effect on the ordering between memory accesses issued by the target
PE. ``nvshmem_wait_until``, ``nvshmem_test``, ``nvshmem_barrier``,
``nvshmem_barrier_all`` routines can be called by the target PE to
guarantee ordering of its memory accesses. There is a subtle difference
between ``nvshmem_fence`` and ``nvshmem_quiet``, in that,
``nvshmem_quiet`` guarantees completion of all operations on symmetric
data objects which makes the updates visible to all other PEs.

The ``nvshmem_quiet`` routine should be called if completion of
operations on symmetric data objects is desired when multiple PEs are
involved.

In an NVSHMEM program with multithreaded PEs, it is the user’s
responsibility to ensure ordering between operations issued by the
threads in a PE that target symmetric memory and calls by threads in
that PE to ``nvshmem_fence``. The ``nvshmem_fence`` routine can enforce
memory store ordering only for the calling thread. Thus, to ensure
ordering for memory stores performed by a thread that is not the thread
calling ``nvshmem_fence``, the update must be made visible to the
calling thread according to the rules of the memory model associated
with the threading environment.

.. _subsec:shmem_quiet:

**NVSHMEM_QUIET**
~~~~~~~~~~~~~~~~~


.. c:function:: void nvshmem\_quiet(void)


.. c:function:: void nvshmemx\_quiet_on_stream(void, cudaStream_t stream)


.. c:function:: __device__ void nvshmem\_quiet(void)

**Description**

The ``nvshmem_quiet`` routine ensures completion of all operations on
symmetric data objects issued by the calling PE.
Table `[mem-order] <#mem-order>`__ lists the operations for which the
``nvshmem_quiet`` routine ensures completion. On systems with only
NVLink, all operations on symmetric data objects are guaranteed to be
complete and visible to all PEs when ``nvshmem_quiet`` returns. On
systems with both NVLink and InfiniBand, visibility is only guaranteed
at the destination PE.

Quiet operations issued on the CPU and the GPU only complete
communication operations that were issued from the CPU and the GPU,
respectively. To ensure completion of GPU-side operations from the CPU,
the developer must perform a GPU-side quiet operation and ensure
completion of the CUDA kernel from which the GPU-side operations were
issued, using operations like ``cudaStreamSynchronize`` or
``cudaDeviceSynchronize``. Alternatively, a stream-based quiet operation
can be used. Stream-based quiet operations have the effect of a quiet
being executed on the GPU in stream order, ensuring completion and
ordering of only GPU-side operations.

**Returns**

None.

**Notes**

``nvshmem_quiet`` is most useful as a way of ensuring completion of
several operations on symmetric data objects initiated by the calling
PE. For example, one might use ``nvshmem_quiet`` to await delivery of a
block of data before issuing another *Put* or nonblocking *Put* routine,
which sets a completion flag on another PE. ``nvshmem_quiet`` is not
usually needed if ``nvshmem_barrier_all`` or ``nvshmem_barrier`` are
called. The barrier routines wait for the completion of outstanding
operations to symmetric data objects on all PEs.

In an NVSHMEM program with multithreaded PEs, it is the user’s
responsibility to ensure ordering between operations issued by the
threads in a PE that target symmetric memory and calls by threads in
that PE to ``nvshmem_quiet``. The ``nvshmem_quiet`` routine can enforce
memory store ordering only for the calling thread. Thus, to ensure
ordering for memory stores performed by a thread that is not the thread
calling ``nvshmem_quiet``, the update must be made visible to the
calling thread according to the rules of the memory model associated
with the threading environment.

A call to ``nvshmem_quiet`` by a thread completes the operations posted
prior to calling ``nvshmem_quiet``. If the user intends to also complete
operations issued by a thread that is not the thread calling
``nvshmem_quiet``, the user must ensure that the operations are
performed prior to the call to ``nvshmem_quiet``. This may require the
use of a synchronization operation provided by the threading package.
For example, when using POSIX Threads, the user may call the
``pthread_barrier_wait`` routine to ensure that all threads have issued
operations before a thread calls ``nvshmem_quiet``.

``nvshmem_quiet`` does not have an effect on the ordering between memory
accesses issued by the target PE. ``nvshmem_wait_until``,
``nvshmem_test``, ``nvshmem_barrier``, ``nvshmem_barrier_all`` routines
can be called by the target PE to guarantee ordering of its memory
accesses.

.. [1]
   NVSHMEM fence routines does not guarantee order of delivery of values
   fetched by nonblocking AMO routines.