Programming Model Overview¶

NVSHMEM is a software library that implements the OpenSHMEM application programming interface (API) for clusters of NVIDIA GPUs. OpenSHMEM is a community standard, one-sided communication API that provides a partitioned global address space (PGAS) parallel programming model. A key goal of the OpenSHMEM specification – and also of NVSHMEM – is to provide an interface that is convenient to use, while also providing high performance with minimal software overheads.

The OpenSHMEM specification is under active development, with regular releases that expand its feature set and extend its ability to utilize emerging node and cluster architectures. The current version of NVSHMEM is based upon the OpenSHMEM version 1.3 APIs and also includes many features from later versions of OpenSHMEM. While NVSHMEM is based on OpenSHMEM, there are important differences that are detailed in this reference.

NVSHMEM provides an easy-to-use host-side interface for allocating symmetric memory, which can be distributed across a cluster of NVIDIA GPUs interconnected with NVLink, PCIe, and InfiniBand. Device-side APIs can be called by CUDA kernel threads to efficiently access locations in symmetric memory through one-sided read (get), write (put), and atomic update API calls. In addition, symmetric memory that is directly accessible to a given GPU (for example, a memory region that is located on the local GPU or a peer GPU connected via NVLink) can be queried and accessed directly via a pointer provided by the NVSHMEM library.

NVSHMEM extends the OpenSHMEM APIs to support clusters of NVIDIA GPUs. The following provides a brief summary of several key extensions:

Support for symmetric allocation of GPU memory.

Support for GPU-initiated communication, including support for CUDA types.

A new API call to collectively launch CUDA kernels across a set of GPUs.

Stream-based APIs that allow data movement operations initiated from the CPU to be offloaded onto the GPU and ordered with regard to a CUDA stream.

Threadgroup communication where threads from whole warps or whole thread blocks in a CUDA kernel can collectively participate in a single communication operation.

Differentiation between synchronizing and non-synchronizing operations to benefit from strengths (weak or strong) of operations in the GPU memory model.

The following provides a brief summary of the differences between NVSHMEM and OpenSHMEM.

API names are prefixed with “nv” to enable hybrid usage of NVSHMEM with an existing OpenSHMEM library.

All buffer arguments to NVSHMEM communication routines must be symmetric.

NVSHMEM provides weak ordering for data returned by blocking operations that fetch data. Ordering can be enforced via the nvshmem_fence operation.

More information on the OpenSHMEM specification can be found at openshmem.org.

Memory Model¶

NVSHMEM Memory Model

An NVSHMEM program consists of data objects that are private to each PE and data objects that are remotely accessible by all PEs. Private data objects are stored in the local memory of each PE and can only be accessed by the PE itself; these data objects cannot be accessed by other PEs via NVSHMEM routines. Private data objects follow the memory model of C. Remotely accessible objects, however, can be accessed by remote PEs using NVSHMEM routines. Remotely accessible data objects are called Symmetric Data Objects. Each symmetric data object has a corresponding object with the same name, type, and size on all PEs where that object is accessible via the NVSHMEMAPI [1].

In NVSHMEM, GPU memory allocated by NVSHMEM memory management routines is symmetric. See Section Memory Management for information on allocating symmetric memory.

NVSHMEM dynamic memory allocation routines (e.g., nvshmem_malloc) allow collective allocation of Symmetric Data Objects on a special memory region called the Symmetric Heap. The Symmetric Heap is created during the execution of a program at a memory location determined by the NVSHMEM library. The Symmetric Heap may reside in different memory regions on different PEs. Figure NVSHMEM Memory Model shows an example NVSHMEM memory layout, illustrating the location of remotely accessible symmetric objects and private data objects.

Pointers to Symmetric Objects¶

Symmetric data objects are referenced in NVSHMEM operations through the local pointer to the desired remotely accessible object. The address contained in this pointer is referred to as a symmetric address. Every symmetric address is also a local address that is valid for direct memory access; however, not all local addresses are symmetric. Manipulation of symmetric addresses passed to NVSHMEM routines—including pointer arithmetic, array indexing, and access of structure or union members—are permitted as long as the resulting local pointer remains within the same symmetric allocation or object. Symmetric addresses are only valid at the PE where they were generated; using a symmetric address generated by a different PE for direct memory access or as an argument to an NVSHMEM routine results in undefined behavior.

Symmetric addresses provided to typed interfaces must be naturally aligned based on their type and any requirements of the underlying architecture. Symmetric addresses provided to fixed-size NVSHMEM interfaces (e.g., nvshmem_put32) must also be aligned to the given size. Symmetric objects provided to fixed-size NVSHMEM interfaces must have storage size equal to the bit-width of the given operation [2]. Because C/C++ structures may contain implementation-defined padding, the fixed-size interfaces should not be used with C/C++ structures. The “mem” interfaces (e.g., nvshmem_putmem) have no alignment requirements.

The nvshmem_ptr routine allows the programmer to query a local address to a remotely accessible data object at a specified PE. The resulting pointer is valid for direct memory access; however, providing this address as an argument of an NVSHMEM routine that requires a symmetric address results in undefined behavior.

Ordering of Operations¶

Blocking operations in NVSHMEM that read data (for example, get or atomic fetch-and-add) are expected to return data according to the order in which the operations are performed. For example, consider a program that performs atomic fetch-and-add of the value $1$ to the symmetric variable $x$ on PE 0.

a = shmem_int_fadd(x, 1, 0);
b = shmem_int_fadd(x, 1, 0);

In this example, the OpenSHMEM specification guarantees that $b > a$ . However, this strong ordering can incur significant overheads on weakly ordered architectures by requiring memory barriers to be performed before any such operation returns. NVSHMEM relaxes this requirement in order to provide a more efficient implementation on NVIDIA GPUs. Thus, NVSHMEM does not guarantee $b > a$ .

Where such ordering is required, programmers can use an nvshmem_fence operation to enforce ordering for blocking operations (for example, between the two statements above). Non-blocking operations are not ordered by calls to nvshmem_fence. Instead, they must be completed using the nvshmem_quiet operation. The completion semantics of fetching operations remain unchanged from the OpenSHMEM OpenSHMEM specification: the result of the get or AMO is available for any dependent operation that appears after it, in program order.

Atomicity Guarantees¶

NVSHMEM contains a number of routines that perform atomic operations on symmetric data objects, which are defined in Section Atomic Memory Operations. The atomic routines guarantee that concurrent accesses by any of these routines to the same location, and using the same datatype (specified in Tables Standard AMO Types and Names and Extended AMO Types and Names) will be exclusive. Exclusivity is also guaranteed when the target PE performs a wait or test operation on the same location and with the same datatype as one or more atomic operations.

NVSHMEM atomic operations do not guarantee exclusivity in the following scenarios, all of which result in undefined behavior.

When concurrent accesses to the same location are performed using NVSHMEM atomic operations using different datatypes.
When atomic and non-atomic NVSHMEM operations are used to access the same location concurrently.
When NVSHMEM atomic operations and non-NVSHMEM operations (e.g., load and store operations) are used to access the same location concurrently.

Execution Model¶

An NVSHMEM program consists of a set of NVSHMEM processes called PEs. While not required by NVSHMEM, in typical usage, PEs are executed using a single program, multiple data (SPMD) model. SPMD requires each PE to use the same executable; however, PEs are able to follow divergent control paths. PEs are implemented using OS processes and PEs are permitted to create additional threads, when threading support is enabled.

PE execution is loosely coupled, relying on NVSHMEM operations to communicate and synchronize among executing PEs. The NVSHMEM phase in a program begins with a call to the initialization routine nvshmem_init or nvshmem_init_thread, which must be performed before using any of the other NVSHMEM library routines. An NVSHMEM program concludes its use of the NVSHMEM library when all PEs call nvshmem_finalize or any PE calls shmem_global_exit. During a call to nvshmem_finalize, the NVSHMEM library must complete all pending communication and release all the resources associated to the library using an implicit collective synchronization across PEs. Calling any NVSHMEM routine before initialization or after nvshmem_finalize leads to undefined behavior. After finalization, a subsequent initialization call also leads to undefined behavior.

The PEs of the NVSHMEM program are identified by unique integers. The identifiers are integers assigned in a monotonically increasing manner from zero to one less than the total number of PEs. PE identifiers are used for NVSHMEM calls (e.g., to specify put or get routines on symmetric data objects, collective synchronization calls) or to dictate a control flow for PEs using constructs of C. The identifiers are fixed for the duration of the NVSHMEM phase of a program.

Progress of NVSHMEM Operations¶

The NVSHMEM model assumes that computation and communication are naturally overlapped. NVSHMEM programs are expected to exhibit progression of communication both with and without NVSHMEM calls. Consider a PE that is engaged in a computation with no NVSHMEM calls. Other PEs should be able to communicate (e.g., put, get, atomic, etc.) and complete communication operations with that computationally-bound PE without that PE issuing any explicit NVSHMEM calls. One-sided NVSHMEM communication calls involving that PE should progress regardless of when that PE next engages in an NVSHMEM call.

Invoking NVSHMEM Operations¶

Pointer arguments to NVSHMEM routines that point to non-const data must not overlap in memory with other arguments to the same NVSHMEM operation, with the exception of in-place reductions as described in Section NVSHMEM_REDUCTIONS. Otherwise, the behavior is undefined. Two arguments overlap in memory if any of their data elements are contained in the same physical memory locations. For example, consider an address $a$ returned by the nvshmem_ptr operation for symmetric object $A$ on PE $i$ . Providing the local address $a$ and the symmetric address of object $A$ to an NVSHMEM operation targeting PE $i$ results in undefined behavior.

Buffers provided to NVSHMEM routines are in-use until the corresponding NVSHMEM operation has completed at the calling PE. Updates to a buffer that is in-use, including updates performed through locally and remotely issued NVSHMEM operations, result in undefined behavior. Similarly, reads from a buffer that is in-use are allowed only when the buffer was provided as a const-qualified argument to the NVSHMEM routine for which it is in-use. Otherwise, the behavior is undefined. Exceptions are made for buffers that are in-use by AMOs, as described in Section Atomicity Guarantees. For information regarding the completion of NVSHMEM operations, see Section Memory Ordering.

NVSHMEM routines with multiple symmetric object arguments do not require these symmetric objects to be located within the same symmetric memory segment. For example, objects located in the symmetric data segment and objects located in the symmetric heap can be provided as arguments to the same NVSHMEM operation.

[1]	For efficiency reasons, the same offset (from an arbitrary memory address) for symmetric data objects might be used on all PEs. Further discussion about symmetric heap layout and implementation efficiency can be found in Section NVSHMEM_MALLOC, SHMEM_FREE, SHMEM_ALIGN

[2]	The bit-width of a byte is implementation-defined in C. The CHAR_BIT constant in limits.h can be used to portably calculate the bit-width of a C object.