.. _sec:rma: Remote Memory Access -------------------- The RMA routines described in this section can be used to perform reads from and writes to symmetric data objects. These operations are one-sided, meaning that the PE invoking an operation provides all communication parameters and the targeted PE is passive. A characteristic of one-sided communication is that it decouples communication from synchronization. One-sided communication mechanisms transfer data; however, they do not synchronize the sender of the data with the receiver of the data. NVSHMEMRMA routines are performed on symmetric data objects. The initiator PE of a call is designated as the *origin* PE and the PE targeted by an operation is designated as the *destination* PE. The ``source`` and ``dest`` designators refer to the data objects that an operation reads from and writes to. In the case of the remote update routine, *Put*, the origin PE provides the ``source`` data object and the destination PE provides the ``dest`` data object. In the case of the remote read routine, *Get*, the origin PE provides the ``dest`` data object and the destination PE provides the ``source`` data object. The standard RMA types include the exact-width integer types defined in ``stdint.h`` by *C*  §7.18.1.1 and *C* §7.20.1.1. When the *C* translation environment does not provide exact-width integer types with ``stdint.h``, an NVSHMEM implemementation is not required to provide support for these types. .. _stdrmatypes: .. table:: Standard RMA Types and Names ================== ========== *TYPE* *TYPENAME* ================== ========== float float double double char char signed char schar short short int int long long long long longlong unsigned char uchar unsigned short ushort unsigned int uint unsigned long ulong unsigned long long ulonglong int8_t int8 int16_t int16 int32_t int32 int64_t int64 uint8_t uint8 uint16_t uint16 uint32_t uint32 uint64_t uint64 size_t size ptrdiff_t ptrdiff ================== ========== .. _subsec:rma: Blocking RMA ~~~~~~~~~~~~ .. _subsec:shmem_put: **NVSHMEM_PUT** ^^^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_put(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_TYPENAME\_put_on_stream(TYPE *dest, const TYPE *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_put(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_put_block(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_put_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. .. c:function:: void nvshmem\_putSIZE(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_putSIZE_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_putSIZE(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putSIZE_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putSIZE_warp(void *dest, const void *source, size_t nelems, int pe) where *SIZE* is one of ``8, 16, 32, 64, 128``. .. c:function:: void nvshmem\_putmem(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_putmem_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_putmem(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putmem_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putmem_warp(void *dest, const void *source, size_t nelems, int pe) *dest [OUT]* Symmetric address of the destination data object. The type of ``dest`` should match that implied in the SYNOPSIS section. *source [IN]* Local address of the data object containing the data to be copied. The type of ``source`` should match that implied in the SYNOPSIS section. *nelems [IN]* Number of elements in the ``dest`` and ``source`` arrays. For ``nvshmem_putmem`` and ``nvshmem_ctx_putmem``, elements are bytes. *pe [IN]* PE number of the remote PE. **Description** The routines return after the data has been copied out of the ``source`` array on the local PE. The delivery of data words into the data object on the destination PE may occur in any order. Furthermore, two successive put routines may deliver data out of order unless a call to ``nvshmem_fence`` is introduced between the two calls. **Returns** None. .. _subsec:shmem_p: **NVSHMEM_P** ^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_p(TYPE *dest, TYPE value, int pe) .. c:function:: __device__ void nvshmem\_TYPENAME\_p(TYPE *dest, TYPE value, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. *dest [OUT]* Symmetric address of the destination data object. The type of ``dest`` should match that implied in the SYNOPSIS section. *value [IN]* The value to be transferred to ``dest``. The type of ``value`` should match that implied in the SYNOPSIS section. *pe [IN]* The number of the remote PE. **Description** These routines provide a very low latency put capability for single elements of most basic types. As with ``nvshmem_put``, these routines start the remote transfer and may return before the data is delivered to the remote PE. Use ``nvshmem_quiet`` to force completion of all remote *Put* transfers. **Returns** None. .. _subsec:shmem_iput: **NVSHMEM_IPUT** ^^^^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_iput(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: void nvshmemx\_TYPENAME\_iput_on_stream(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_iput(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_iput_block(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_iput_warp(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. .. c:function:: void nvshmem\_iputSIZE(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: void nvshmemx\_iputSIZE_on_stream(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_iputSIZE(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_iputSIZE_block(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_iputSIZE_warp(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) where *SIZE* is one of ``8, 16, 32, 64, 128``. *dest [OUT]* Symmetric address of the destination array data object. The type of ``dest`` should match that implied in the SYNOPSIS section. *source [IN]* Local address of the array containing the data to be copied. The type of ``source`` should match that implied in the SYNOPSIS section. *dst [IN]* The stride between consecutive elements of the ``dest`` array. The stride is scaled by the element size of the ``dest`` array. A value of ``1`` indicates contiguous data. *sst [IN]* The stride between consecutive elements of the ``source`` array. The stride is scaled by the element size of the ``source`` array. A value of ``1`` indicates contiguous data. *nelems [IN]* Number of elements in the ``dest`` and ``source`` arrays. *pe [IN]* PE number of the remote PE. **Description** The ``iput`` routines provide a method for copying strided data elements (specified by ``sst``) of an array from a ``source`` array on the local PE to locations specified by stride ``dst`` on a ``dest`` array on specified remote PE. Both strides, ``dst`` and ``sst``, must be greater than or equal to ``1``. The routines return when the data has been copied out of the ``source`` array on the local PE but not necessarily before the data has been delivered to the remote data object. **Returns** None. **Notes** See Section :ref:`subsec:memory_model` for a definition of the term remotely accessible. .. _subsec:shmem_get: **NVSHMEM_GET** ^^^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_get(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_TYPENAME\_get_on_stream(TYPE *dest, const TYPE *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_get(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_get_block(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_get_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. .. c:function:: void nvshmem\_getSIZE(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_getSIZE_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_getSIZE(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getSIZE_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getSIZE_warp(void *dest, const void *source, size_t nelems, int pe) where *SIZE* is one of ``8, 16, 32, 64, 128``. .. c:function:: void nvshmem\_getmem(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_getmem_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_getmem(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getmem_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getmem_warp(void *dest, const void *source, size_t nelems, int pe) *dest [OUT]* Local address of the data object to be updated. The type of ``dest`` should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of the source data object. The type of ``source`` should match that implied in the SYNOPSIS section. *nelems [IN]* Number of elements in the ``dest`` and ``source`` arrays. For ``nvshmem_getmem`` and ``nvshmem_ctx_getmem``, elements are bytes. *pe [IN]* PE number of the remote PE. **Description** The get routines provide a method for copying a contiguous symmetric data object from a different PE to a contiguous data object on the local PE. The routines return after the data has been delivered to the ``dest`` array on the local PE. **Returns** None. **Notes** See Section :ref:`subsec:memory_model` for a definition of the term remotely accessible. .. _subsec:shmem_g: **NVSHMEM_G** ^^^^^^^^^^^^^ .. c:function:: TYPE nvshmem\_TYPENAME\_g(const TYPE *source, int pe) .. c:function:: __device__ TYPE nvshmem\_TYPENAME\_g(const TYPE *source, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. *source [IN]* Symmetric address of the source data object. The type of ``source`` should match that implied in the SYNOPSIS section. *pe [IN]* The number of the remote PE on which ``source`` resides. **Description** These routines provide a very low latency get capability for single elements of most basic types. **Returns** Returns a single element of type specified in the synopsis. .. _subsec:shmem_iget: **NVSHMEM_IGET** ^^^^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_iget(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: void nvshmemx\_TYPENAME\_iget_on_stream(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_iget(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_iget_block(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_iget_warp(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. .. c:function:: void nvshmem\_igetSIZE(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: void nvshmemx\_igetSIZE_on_stream(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_igetSIZE(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_igetSIZE_block(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_igetSIZE_warp(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe) where *SIZE* is one of ``8, 16, 32, 64, 128``. *dest [OUT]* Local address of the array to be updated. The type of ``dest`` should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of the source array data object. The type of ``source`` should match that implied in the SYNOPSIS section. *dst [IN]* The stride between consecutive elements of the ``dest`` array. The stride is scaled by the element size of the ``dest`` array. A value of ``1`` indicates contiguous data. *sst [IN]* The stride between consecutive elements of the ``source`` array. The stride is scaled by the element size of the ``source`` array. A value of ``1`` indicates contiguous data. *nelems [IN]* Number of elements in the ``dest`` and ``source`` arrays. *pe [IN]* PE number of the remote PE. **Description** The ``iget`` routines provide a method for copying strided data elements from a symmetric array from a specified remote PE to strided locations on a local array. The routines return when the data has been copied into the local ``dest`` array. **Returns** None. .. _subsec:rma_nbi: Nonblocking RMA ~~~~~~~~~~~~~~~ .. _subsec:shmem_put_nbi: **NVSHMEM_PUT_NBI** ^^^^^^^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_put\_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_TYPENAME\_put\_nbi_on_stream(TYPE *dest, const TYPE *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_put\_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_put\_nbi_block(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_put\_nbi_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. .. c:function:: void nvshmem\_putSIZE\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_putSIZE\_nbi_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_putSIZE\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putSIZE\_nbi_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putSIZE\_nbi_warp(void *dest, const void *source, size_t nelems, int pe) where *SIZE* is one of ``8, 16, 32, 64, 128``. .. c:function:: void nvshmem\_putmem\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_putmem\_nbi_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_putmem\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putmem\_nbi_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_putmem\_nbi_warp(void *dest, const void *source, size_t nelems, int pe) *dest [OUT]* Symmetric address of the destination data object. The type of ``dest`` should match that implied in the SYNOPSIS section. *source [IN]* Local address of the object containing the data to be copied. The type of ``source`` should match that implied in the SYNOPSIS section. *nelems [IN]* Number of elements in the ``dest`` and ``source`` arrays. For ``nvshmem_putmem_nbi`` and ``nvshmem_ctx_putmem_nbi``, elements are bytes. *pe [IN]* PE number of the remote PE. **Description** The routines return after initiating the operation. The operation is considered complete after a subsequent call to ``nvshmem_quiet``. At the completion of ``nvshmem_quiet``, the data has been copied into the ``dest`` array on the destination PE. The delivery of data words into the data object on the destination PE may occur in any order. Furthermore, two successive put routines may deliver data out of order unless a call to ``nvshmem_fence`` is introduced between the two calls. **Returns** None. .. _subsec:shmem_get_nbi: **NVSHMEM_GET_NBI** ^^^^^^^^^^^^^^^^^^^ .. c:function:: void nvshmem\_TYPENAME\_get\_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_TYPENAME\_get\_nbi_on_stream(TYPE *dest, const TYPE *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_TYPENAME\_get\_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_get\_nbi_block(TYPE *dest, const TYPE *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_TYPENAME\_get\_nbi_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe) where *TYPE* is one of the standard RMA types and has a corresponding *TYPENAME* specified by Table :ref:`stdrmatypes`. .. c:function:: void nvshmem\_getSIZE\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_getSIZE\_nbi_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_getSIZE\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getSIZE\_nbi_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getSIZE\_nbi_warp(void *dest, const void *source, size_t nelems, int pe) where *SIZE* is one of ``8, 16, 32, 64, 128``. .. c:function:: void nvshmem\_getmem\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: void nvshmemx\_getmem\_nbi_on_stream(void *dest, const void *source, size_t nelems, int pe, cudaStream_t stream) .. c:function:: __device__ void nvshmem\_getmem\_nbi(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getmem\_nbi_block(void *dest, const void *source, size_t nelems, int pe) .. c:function:: __device__ void nvshmemx\_getmem\_nbi_warp(void *dest, const void *source, size_t nelems, int pe) *dest [OUT]* Local address of the data object to be updated. The type of ``dest`` should match that implied in the SYNOPSIS section. *source [IN]* Symmetric address of the source data object. The type of ``source`` should match that implied in the SYNOPSIS section. *nelems [IN]* Number of elements in the ``dest`` and ``source`` arrays. For ``nvshmem_getmem_nbi`` and ``nvshmem_ctx_getmem_nbi``, elements are bytes. *pe [IN]* PE number of the remote PE. **Description** The get routines provide a method for copying a contiguous symmetric data object from a different PE to a contiguous data object on the local PE. The routines return after initiating the operation. The operation is considered complete after a subsequent call to ``nvshmem_quiet``. At the completion of ``nvshmem_quiet``, the data has been delivered to the ``dest`` array on the local PE. **Returns** None. **Notes** See Section :ref:`subsec:memory_model` for a definition of the term remotely accessible.