Overview of the APIs
====================

NVSHMEM includes the OpenSHMEM 1.3 API, and several APIs that are defined in
OpenSHMEM versions 1.4 and 1.5.

Here are the naming conventions:

 * OpenSHMEM function names are prefixed with nv, for example, ``nvshmem_init``.
 * Type names are prefixed with nv, for example, ``nvshmem_team_t``.
 * Constants are prefixed with NV, for example, ``NVSHMEM_VENDOR_STRING``.
 * Environment variables are prefixed with ``NVSHMEM``, for example, ``NVSHMEM_SYMMETRIC_SIZE``.

NVSHMEM functions are classified based on where they can be invoked:

 * On the host.
 * On the GPU.
 * On both the host and the GPU.

The following APIs can be invoked only on the host:

 * Initialization and termination, for example, ``nvshmem_init`` and ``nvshmem_finalize``.
 * Memory management, for example, ``nvshmem_malloc`` and ``nvshmem_free``.
 * Collective kernel launch, for example, ``nvshmemx_collective_launch``.
 * Stream-based operations.

The following APIs can be invoked only on the GPU:

 * Thread block scoped operations, for example, ``nvshmem_putmem_block``.
 * Thread warp scoped operations, for example, ``nvshmem_putmem_warp``.

The remaining operations, including one-sided remote memory access, one-sided
remote atomic memory access, memory ordering, point-to-point synchronization,
collectives, pointer query, and PE information query operations are supported
from both the host and the device.

Unsupported OpenSHMEM 1.3 APIs
--------------------------------

The following OpenSHMEM 1.3 APIs are not currently supported in NVSHMEM:

 * OpenSHMEM Fortran API
 * ``shmem_global_exit``
 * ``shmem_pe_accessible``
 * ``shmem_addr_accessible``
 * ``shmem_realloc``
 * ``shmem_fcollect``
 * ``shmem_alltoalls``
 * ``shmem_lock``
 * ``_SHMEM_*`` constants (deprecated)
 * ``SMA_*`` environment variables (deprecated)
 * CUDA supports only a subset of the atomic operations in the
   OpenSHMEM specification. In NVSHMEM, the ``long long``, ``int64_t``, and ``ptrdiff_t`` types are
   currently unsupported for ``add``, ``fetch_add``, ``inc``, and ``fetch_inc`` atomic
   operations.

OpenSHMEM 1.3 APIs Not Supported Over InfiniBand
------------------------------------------------

The following OpenSHMEM 1.3 APIs are not currently supported over InfiniBand in
NVSHMEM:

 * ``shmem_iput``
 * ``shmem_iget``
 * ``shmem_atomic_<all operations>``

Supported OpenSHMEM APIs (OpenSHMEM 1.4 and 1.5)
------------------------------------------------

The following OpenSHMEM 1.5 APIs are supported by NVSHMEM:

 * ``nvshmem_wait_until_{any, all, some}`` and ``nvshmem_wait_until_{any, all, some}_vector``
 * ``nvshmem_test_{any, all, some}`` and ``nvshmem_wait_until_{any, all, some}_vector``

The following OpenSHMEM 1.4 APIs are supported by NVSHMEM:

 * Threading support, with functions ``nvshmem_init_thread`` and
   ``nvshmem_query_thread``, and ``NVSHMEM_THREAD_SINGLE``,
   ``NVSHMEM_THREAD_FUNNELED``, ``NVSHMEM_THREAD_SERIALIZED``, and
   ``NVSHMEM_THREAD_MULTIPLE`` constants.
 * ``NVSHMEM_SYNC_SIZE`` constant
 * ``nvshmem_calloc`` (host)
 * Bitwise atomic memory operations ``and``, ``fetch_and``, ``or``, ``fetch_or``,
   ``xor``, and ``fetch_xor`` (host and device)
 * ``nvshmem_sync`` and ``nvshmem_sync_all`` (host and device)

NVSHMEM API Extensions For CPU Threads
--------------------------------------

NVSHMEM extension APIs invoked only by CPU threads:

Initialization
    ``nvshmemx_init_attr``

CUDA kernel launch
    ``nvshmemx_collective_launch``

    CUDA kernels that invoke synchronizing NVSHMEM APIs such as
    ``nvshmem_barrier``, ``nvshmem_wait``, collective operations, and others, must
    be launched using this API; otherwise behavior is undefined.

Collective launch grid size
    ``nvshmemx_collective_launch_query_gridsize``

    Used to query the largest grid size that can be used for the given kernel
    with CUDA cooperative launch on the current GPU.

Remote memory access
    ``nvshmemx_put_<all_variants>_on_stream``, ``nvshmemx_get_<all_variants>_on_stream``

    Asynchronous with respect to the calling CPU thread; takes a cudaStream_t
    as argument and is ordered on that CUDA stream.

Memory ordering
    ``nvshmemx_quiet_on_stream``

Collective communication
    ``nvshmemx_broadcast_<all_variants>_on_stream``,
    ``nvshmemx_collect__<all_variants>_on_stream``,
    ``nvshmemx_alltoall_<all_variants>_on_stream``, and
    ``nvshmemx_to_all_<all_variants>_on_stream`` (reductions)

Collective synchronization
    ``nvshmemx_barrier_all_on_stream``, ``nvshmemx_barrier_on_stream``,
    ``nvshmemx_sync_all_on_stream``, and ``nvshmemx_sync_on_stream``

NVSHMEM extends the remote memory access (get and put), memory ordering,
collective communication, and collective synchronization APIs with support for
CUDA streams. Each steam-based function performs the same operation as
described in the OpenSHMEM specification. An additional argument of 
``cudaStream_t`` type is added as the last argument to each function and indicates the
stream on which the operation is enqueued.

Ordering APIs (fence, quiet, and barrier) that are issued on the CPU and the
GPU only order communication operations that were issued from the CPU and the
GPU, respectively. To ensure completion of GPU-side operations from the CPU,
the developer must perform a GPU-side quiet operation and ensure completion of
the CUDA kernel from which the GPU-side operations were issued, using
operations like ``cudaStreamSynchronize`` or ``cudaDeviceSynchronize``.
Alternatively, a stream-based quiet operation can be used. Stream-based quiet
operations have the effect of a quiet being executed on the GPU in stream
order, ensuring completion and ordering of only GPU-side operations.

NVSHMEM API Extensions For GPU Threads
--------------------------------------

RMA Write
    ``nvshmemx_put_block``, ``nvshmemx_put_warp``

    New APIs for GPU-side invocation are provided that can be called
    collectively by a threadblock or a warp.

RMA Read
    ``nvshmemx_get_block``, ``nvshmemx_get_warp``

Asynchronous RMA write
    ``nvshmemx_put_nbi_block``, ``nvshmemx_put_nbi_warp``

Asynchronous RMA read
    ``nvshmemx_get_nbi_block``, ``nvshmemx_get_nbi_warp``

Collective communication
    ``nvshmemx_broadcast_<all_variants>_block``,
    ``nvshmemx_broadcast_<all_variants>_warp``,
    ``nvshmemx_collect__<all_variants>_block``,
    ``nvshmemx_collect__<all_variants>_warp``,
    ``nvshmemx_alltoall_<all_variants>_block``,
    ``nvshmemx_alltoall_<all_variants>_warp``,
    ``nvshmemx_to_all_<all_variants>_block``, and
    ``nvshmemx_to_all_<all_variants>_warp`` (reductions)


Collective synchronization
    ``nvshmemx_barrier_all_block``, ``nvshmemx_barrier_all_warp``,
    ``nvshmemx_barrier_block``, ``nvshmemx_barrier_warp``,
    ``nvshmemx_sync_all_block``, ``nvshmemx_sync_all_warp``,
    ``nvshmemx_sync_block``, and ``nvshmemx_sync_warp``


These extension APIs can be invoked by GPU threads. Each of the API has two
variants each – one with the ``_block`` suffix and the other with the
``_warp`` suffix. For example, the OpenSHMEM API ``shmem_float_put`` has two extension
APIs in NVSHMEM, ``nvshmemx_float_put_block`` and ``nvshmemx_float_put_warp``.

These extension APIs are collective calls that must be called by every thread
in the scope of the API and with exactly the same arguments. The scope of the
``*_block`` extension APIs is the block in which the thread resides. Similarly,
the scope of the ``*_warp`` extension API is the warp in which the thread
resides. For example, if thread 0 calls ``nvshmem_float_put_block``, then every
other thread that is in the same block as thread 0 must also call
``nvshmem_float_put_block`` with the same arguments. Otherwise, the call
results in erroneous behavior or a deadlock in the program. The NVSHMEM
runtime might or might not leverage the multiple threads in the scope of the API to
execute the API call.

The extension APIs are useful in the following situations:

 * Converting ``nvshmem_float_put`` to ``nvshmemx_float_put_block`` enables the
   NVSHMEM runtime to leverage all the threads in the block to concurrently
   copy the data to the destination PE if the destination GPU of the put call
   is p2p connected. If the destination GPU is connected via InfiniBand, then a
   single thread in the block can issue an RMA write operation to the
   destination GPU.
 * The ``*_block`` and ``*_warp`` extensions of the collective APIs can use
   multiple threads to perform collective operations, such as parallel
   reduction operations of multiple threads sending
   data in parallel.