.. _subsec:programming_model:

Programming Model Overview
==========================

.. include:: ../overview.in

.. _subsec:memory_model:

Memory Model
============

.. figure:: figures/mem_model.png
   :alt: NVSHMEM Memory Model
   :name: fig:mem_model
   :width: 95.0%

   NVSHMEM Memory Model

An NVSHMEM program consists of data objects that are private to each PE
and data objects that are remotely accessible by all PEs. Private data
objects are stored in the local memory of each PE and can only be
accessed by the PE itself; these data objects cannot be accessed by
other PEs via NVSHMEM routines. Private data objects follow the memory
model of *C*. Remotely accessible objects, however, can be accessed by
remote PEs using NVSHMEM routines. Remotely accessible data objects are
called *Symmetric Data Objects*. Each symmetric data object has a
corresponding object with the same name, type, and size on all PEs where
that object is accessible via the NVSHMEMAPI [1]_.

In NVSHMEM, GPU memory allocated by NVSHMEM memory management routines
is symmetric. See
Section :ref:`sec:memory_management` for
information on allocating symmetric memory.

NVSHMEM dynamic memory allocation routines (e.g., *nvshmem_malloc*)
allow collective allocation of *Symmetric Data Objects* on a special
memory region called the *Symmetric Heap*. The Symmetric Heap is created
during the execution of a program at a memory location determined by the
NVSHMEM library. The Symmetric Heap may reside in different memory
regions on different PEs. Figure :ref:`fig:mem_model` shows an example
NVSHMEM memory layout, illustrating the location of remotely accessible
symmetric objects and private data objects.

.. _subsec:pointers_to_symmetric_objects:

Pointers to Symmetric Objects
-----------------------------

Symmetric data objects are referenced in NVSHMEM operations through the
local pointer to the desired remotely accessible object. The address
contained in this pointer is referred to as a *symmetric address*. Every
symmetric address is also a *local address* that is valid for direct
memory access; however, not all local addresses are symmetric.
Manipulation of symmetric addresses passed to NVSHMEM routines—including
pointer arithmetic, array indexing, and access of structure or union
members—are permitted as long as the resulting local pointer remains
within the same symmetric allocation or object. Symmetric addresses are
only valid at the PE where they were generated; using a symmetric
address generated by a different PE for direct memory access or as an
argument to an NVSHMEM routine results in undefined behavior.

Symmetric addresses provided to typed interfaces must be naturally
aligned based on their type and any requirements of the underlying
architecture. Symmetric addresses provided to fixed-size NVSHMEM
interfaces (e.g., *nvshmem_put32*) must also be aligned to the given
size. Symmetric objects provided to fixed-size NVSHMEM interfaces must
have storage size equal to the bit-width of the given operation [2]_.
Because *C/C++* structures may contain implementation-defined padding,
the fixed-size interfaces should not be used with *C/C++* structures.
The “mem” interfaces (e.g., *nvshmem_putmem*) have no alignment
requirements.

The *nvshmem_ptr* routine allows the programmer to query a *local
address* to a remotely accessible data object at a specified PE. The
resulting pointer is valid for direct memory access; however, providing
this address as an argument of an NVSHMEM routine that requires a
symmetric address results in undefined behavior.

.. _subsec:mem_order:

Ordering of Operations
----------------------

Blocking operations in NVSHMEM that read data (for example, get or
atomic fetch-and-add) are expected to return data according to the order
in which the operations are performed. For example, consider a program
that performs atomic fetch-and-add of the value :math:`1` to the
symmetric variable :math:`x` on PE 0.

::

       a = shmem_int_fadd(x, 1, 0);
       b = shmem_int_fadd(x, 1, 0);

In this example, the OpenSHMEM specification guarantees that
:math:`b > a`. However, this strong ordering can incur significant
overheads on weakly ordered architectures by requiring memory barriers
to be performed before any such operation returns. NVSHMEM relaxes this
requirement in order to provide a more efficient implementation on
NVIDIA GPUs. Thus, NVSHMEM does not guarantee :math:`b > a`.

Where such ordering is required, programmers can use an *nvshmem_fence*
operation to enforce ordering for blocking operations (for example,
between the two statements above). Non-blocking operations are not
ordered by calls to *nvshmem_fence*. Instead, they must be completed
using the *nvshmem_quiet* operation. The completion semantics of
fetching operations remain unchanged from the OpenSHMEM OpenSHMEM
specification: the result of the get or AMO is available for any
dependent operation that appears after it, in program order.

.. _subsec:amo_guarantees:

Atomicity Guarantees
--------------------

NVSHMEM contains a number of routines that perform atomic operations on
symmetric data objects, which are defined in
Section :ref:`sec:amo`. The atomic routines guarantee that
concurrent accesses by any of these routines to the same location, and
using the same datatype (specified in
Tables :ref:`stdamotypes` and
:ref:`extamotypes`) will be exclusive. Exclusivity is also
guaranteed when the target PE performs a wait or test operation on the
same location and with the same datatype as one or more atomic
operations.

NVSHMEM atomic operations do not guarantee exclusivity in the following
scenarios, all of which result in undefined behavior.

#. When concurrent accesses to the same location are performed using
   NVSHMEM atomic operations using different datatypes.

#. When atomic and non-atomic NVSHMEM operations are used to access the
   same location concurrently.

#. When NVSHMEM atomic operations and non-NVSHMEM operations (e.g., load
   and store operations) are used to access the same location
   concurrently.

.. _subsec:execution_model:

Execution Model
===============

An NVSHMEM program consists of a set of NVSHMEM processes called PEs.
While not required by NVSHMEM, in typical usage, PEs are executed using
a single program, multiple data (SPMD) model. SPMD requires each PE to
use the same executable; however, PEs are able to follow divergent
control paths. PEs are implemented using OS processes and PEs are
permitted to create additional threads, when threading support is
enabled.

PE execution is loosely coupled, relying on NVSHMEM operations to
communicate and synchronize among executing PEs. The NVSHMEM phase in a
program begins with a call to the initialization routine *nvshmem_init*
or *nvshmem_init_thread*, which must be performed before using any of
the other NVSHMEM library routines. An NVSHMEM program concludes its use
of the NVSHMEM library when all PEs call *nvshmem_finalize* or any PE
calls *shmem_global_exit*. During a call to *nvshmem_finalize*, the
NVSHMEM library must complete all pending communication and release all
the resources associated to the library using an implicit collective
synchronization across PEs. Calling any NVSHMEM routine before
initialization or after *nvshmem_finalize* leads to undefined behavior.
After finalization, a subsequent initialization call also leads to
undefined behavior.

The PEs of the NVSHMEM program are identified by unique integers. The
identifiers are integers assigned in a monotonically increasing manner
from zero to one less than the total number of PEs. PE identifiers are
used for NVSHMEM calls (e.g., to specify *put* or *get* routines on
symmetric data objects, collective synchronization calls) or to dictate
a control flow for PEs using constructs of *C*. The identifiers are
fixed for the duration of the NVSHMEM phase of a program.

.. _subsec:progress:

Progress of NVSHMEM Operations
------------------------------

The NVSHMEM model assumes that computation and communication are
naturally overlapped. NVSHMEM programs are expected to exhibit
progression of communication both with and without NVSHMEM calls.
Consider a PE that is engaged in a computation with no NVSHMEM calls.
Other PEs should be able to communicate (e.g., *put*, *get*, *atomic*,
etc.) and complete communication operations with that
computationally-bound PE without that PE issuing any explicit NVSHMEM
calls. One-sided NVSHMEM communication calls involving that PE should
progress regardless of when that PE next engages in an NVSHMEM call.

.. _subsec:invoking_openshmem_operations:

Invoking NVSHMEM Operations
---------------------------

Pointer arguments to NVSHMEM routines that point to non-*const* data
must not overlap in memory with other arguments to the same NVSHMEM
operation, with the exception of in-place reductions as described in
Section :ref:`subsec:shmem_reductions`.
Otherwise, the behavior is undefined. Two arguments overlap in memory if
any of their data elements are contained in the same physical memory
locations. For example, consider an address :math:`a` returned by the
*nvshmem_ptr* operation for symmetric object :math:`A` on PE :math:`i`.
Providing the local address :math:`a` and the symmetric address of
object :math:`A` to an NVSHMEM operation targeting PE :math:`i` results
in undefined behavior.

Buffers provided to NVSHMEM routines are *in-use* until the
corresponding NVSHMEM operation has completed at the calling PE. Updates
to a buffer that is in-use, including updates performed through locally
and remotely issued NVSHMEM operations, result in undefined behavior.
Similarly, reads from a buffer that is in-use are allowed only when the
buffer was provided as a *const*-qualified argument to the NVSHMEM
routine for which it is in-use. Otherwise, the behavior is undefined.
Exceptions are made for buffers that are in-use by AMOs, as described in
Section :ref:`subsec:amo_guarantees`. For information regarding the
completion of NVSHMEM operations, see
Section :ref:`subsec:memory_order`.

NVSHMEM routines with multiple symmetric object arguments do not require
these symmetric objects to be located within the same symmetric memory
segment. For example, objects located in the symmetric data segment and
objects located in the symmetric heap can be provided as arguments to
the same NVSHMEM operation.

.. [1]
   For efficiency reasons, the same offset (from an arbitrary memory
   address) for symmetric data objects might be used on all PEs. Further
   discussion about symmetric heap layout and implementation efficiency
   can be found in Section :ref:`subsec:shfree`

.. [2]
   The bit-width of a byte is implementation-defined in *C*. The
   *CHAR_BIT* constant in *limits.h* can be used to portably calculate
   the bit-width of a *C* object.