NVSHMEM Communication Library APIs

This section describes the Fortran interfaces to the NVSHMEM library. NVSHMEM is a software library that implements the OpenSHMEM application programming interface (API) for clusters of NVIDIA GPUs. OpenSHMEM is a community standard, one-sided communication API that provides a partitioned global address space (PGAS) parallel programming model. NVSHMEM provides an easy-to-use host-side interface for allocating symmetric memory, which can be distributed across a cluster of NVIDIA GPUs interconnected with NVLink, PCIe, and InfiniBand. The NVSHMEM communication functions are accessible from both host and device code. Most of the runtime API routines are written as C void functions, and we have implemented their Fortran wrappers as subroutines.

The NVSHMEM interfaces and definitions described in this chapter can be exposed by adding the line

use nvshmem

to your program unit. The same module is used for both host and device code. Device functions which run on a thread block or warp are declared as acc vector nohost routines. Others are acc seq routines.

Unless a specific kind is provided, the plain integer type used in the interfaces implies integer(4) and the plain real type implies real(4).

NVSHMEM Definitions, Setup, Exit, and Query Functions

This section contains definitions and data types used in the NVSHMEM library and interfaces to the NVSHMEM initialization and access to the parallel environment of the PEs.

The Fortran NVSHMEM module contains the following constant and derived type definitions:

! These are not available to the user, internal only
! defines, from nvshmemx_api.h
#define INIT_HANDLE_BYTES 128

! defines, from nvshmem_constants.h
#define SYNC_SIZE 27648

! Constant Definitions
integer, parameter :: NVSHMEM_SYNC_VALUE = 0
integer, parameter :: NVSHMEM_SYNC_SIZE = (2 * SYNC_SIZE)
integer, parameter :: NVSHMEM_BARRIER_SYNC_SIZE = (2 * SYNC_SIZE)
integer, parameter :: NVSHMEM_BCAST_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_REDUCE_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_REDUCE_MIN_WRKDATA_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_COLLECT_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_ALLTOALL_SYNC_SIZE = SYNC_SIZE

integer, parameter :: NVSHMEMX_CMP_EQ = 0
integer, parameter :: NVSHMEMX_CMP_NE = 1
integer, parameter :: NVSHMEMX_CMP_GT = 2
integer, parameter :: NVSHMEMX_CMP_LE = 3
integer, parameter :: NVSHMEMX_CMP_LT = 4
integer, parameter :: NVSHMEMX_CMP_GE = 5

integer, parameter :: NVSHMEMX_THREAD_SINGLE     = 0
integer, parameter :: NVSHMEMX_THREAD_FUNNELED   = 1
integer, parameter :: NVSHMEMX_THREAD_SERIALIZED = 2
integer, parameter :: NVSHMEMX_THREAD_MULTIPLE   = 3

integer, parameter :: NVSHMEM_TEAM_INVALID = -1
integer, parameter :: NVSHMEM_TEAM_WORLD   = 0
integer, parameter :: NVSHMEM_TEAM_SHARED  = 1
integer, parameter :: NVSHMEMX_TEAM_NODE   = 2

integer, parameter :: NVSHMEMX_INIT_THREAD_PES    = 1
integer, parameter :: NVSHMEMX_INIT_WITH_MPI_COMM = 2
integer, parameter :: NVSHMEMX_INIT_WITH_SHMEM    = 4
integer, parameter :: NVSHMEMX_INIT_WITH_HANDLE   = 8

! Types from nvshmemx_api.h
type, bind(c) :: nvshmemx_init_handle
  character(c_char) :: content(INIT_HANDLE_BYTES)
end type nvshmemx_init_handle

! Types from nvshmemx_api.h
type, bind(c) :: nvshmemx_init_attr_type
  integer(8)  heap_size
  integer(4)  num_threads
  integer(4)  n_pes
  integer(4)  my_pe
  type(c_ptr) mpi_comm
  type(nvshmemx_init_handle) handle
end type nvshmemx_init_attr_type

! Types from nvshmem_types.h
type, bind(c) :: nvshmem_team_config
  integer(c_int) :: num_contexts
end type nvshmem_team_config

! nvshmemx_status, from nvshmem_error.h
type, bind(c) :: nvshmemx_status
  integer(c_int) :: member
end type nvshmemx_status

type(nvshmemx_status), parameter :: &
  NVSHMEMX_SUCCESS = nvshmemx_status(0), &
  NVSHMEMX_ERROR_INVALID_VALUE = nvshmemx_status(1), &
  NVSHMEMX_ERROR_OUT_OF_MEMORY = nvshmemx_status(2), &
  NVSHMEMX_ERROR_NOT_SUPPORTED = nvshmemx_status(3), &
  NVSHMEMX_ERROR_SYMMETRY = nvshmemx_status(4), &
  NVSHMEMX_ERROR_GPU_NOT_SELECTED = nvshmemx_status(5), &
  NVSHMEMX_ERROR_COLLECTIVE_LAUNCH_FAILED = nvshmemx_status(6), &
  NVSHMEMX_ERROR_INTERNAL = nvshmemx_status(7)

nvshmem_init

This subroutine allocates and initializes resources used by the NVSHMEM library.

subroutine nvshmem_init()

nvshmemx_init_attr

This function initializes the NVSHMEM library based on an existing MPI communicator. Since the C and Fortran mpi_comm objects differ, this function has a different argument list than the corresponding C library entry point.

type(nvshmemx_status) function nvshmemx_init_attr(flags, comm)
    integer(4) :: flags, comm

Here is an example of using this function with MPI

use nvshmem
type(nvshmemx_status) :: nvstat
. . .
! Setup MPI
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, nranks, ierror)
!
nvstat = nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, MPI_COMM_WORLD)

nvshmem_my_pe

This function returns the PE number of the calling PE, a number between 0 and npes-1.

integer(4) function nvshmem_my_pe()

nvshmem_n_pes

This function returns the number of PEs running in the program.

integer(4) function nvshmem_n_pes()

nvshmem_team_my_pe

This function returns the PE number of the calling PE, within the specified team.

integer(4) function nvshmem_team_my_pe(team)
    integer(4) :: team

nvshmem_team_n_pes

This function returns the number of PEs in the specified team.

integer(4) function nvshmem_team_n_pes(team)
    integer(4) :: team

nvshmem_team_get_config

This function returns the configuration parameters as described by the mask in the config argument.

integer(4) function nvshmem_team_get_config(team, mask, config)
    integer(4) :: team
    integer(8) :: mask
    type(nvshmem_team_config) :: config

nvshmem_team_translate_pe

This function returns the translated destination pe given the source team, the source pe, and the destination team.

integer(4) function nvshmem_team_translate_pe(src_team, src_pe, dest_team)
    integer(4) :: src_team, src_pe, dest_team

nvshmem_team_split_strided

This function performs a collective operation and creates a new team given a parent team and a desired slice (start, stride, and size) from the parent team.

integer(4) function nvshmem_team_split_strided(parent_team, &
        start, stride, size, config, mask, new_team)
    integer(4) :: parent_team
    integer(4) :: start, stride, size
    type(nvshmem_team_config) :: config
    integer(8) :: mask
    integer(4) :: new_team

nvshmem_team_split_2d

This function performs a collective operation and creates two new teams given a parent team and a specification of the 2D space. The result is two teams containing the PEs which map to the 2D space’s row and column.

integer(4) function nvshmem_team_split_2d(parent_team, &
        xrange, xaxis_config, xaxis_mask, xaxis_team, &
        yaxis_config, yaxis_mask, yaxis_team)
    integer(4) :: parent_team
    integer(4) :: xrange
    type(nvshmem_team_config) :: xaxis_config, yaxis_config
    integer(8) :: xaxis_mask, yaxis_mask
    integer(4), intent(out) :: xaxis_team, yaxis_team

nvshmem_team_destroy

This function is a collective operation which destroys the team and frees the resouces associated with it.

integer(4) function nvshmem_team_destroy(team)
    integer(4) :: team

nvshmem_info_get_version

This subroutine returns the major and minor version number of the NVSHMEM library.

subroutine nvshmem_info_get_version(major, minor)
    integer(4) :: major, minor

nvshmem_info_get_name

This subroutine returns the vendor-defined name string for the library.

subroutine nvshmem_info_get_name(name)
    character*256, intent(out) :: name

nvshmem_finalize

This subroutine releases resources and ends the NVSHMEM portion of a program started with nvshmem_init().

subroutine nvshmem_finalize()

nvshmem_ptr

This function returns a local address that may be used to directly reference the destination data on the specified PE. The function nvshmem_ptr is implemented as a Fortran generic function, and can take any datatype, as long as it is a symmetric address.

type(c_devptr) function nvshmem_ptr(dest, pe)
    ! dest can be of type integer, logical, real, complex, character,
    ! or a type(c_devptr)
    integer(4) :: pe

The following specific functions are also supported:

type(c_devptr) function nvshmem_ptri(dest, pe)
    integer :: dest ! Any kind and rank
    integer(4) :: pe

type(c_devptr) function nvshmem_ptrl(dest, pe)
    logical :: dest ! Any kind and rank
    integer(4) :: pe

type(c_devptr) function nvshmem_ptrr(dest, pe)
    real :: dest ! Any kind and rank
    integer(4) :: pe

type(c_devptr) function nvshmem_ptrc(dest, pe)
    complex :: dest ! Any kind and rank
    integer(4) :: pe

type(c_devptr) function nvshmem_ptrc1(dest, pe)
    character :: dest ! Any kind and rank
    integer(4) :: pe

type(c_devptr) function nvshmem_ptrcd(dest, pe)
    type(c_devptr) :: dest
    integer(4) :: pe

NVSHMEM Memory Management Functions

This section contains the Fortran interfaces to NVSHMEM functions used to manage the symmetric heap.

nvshmem_malloc

This function allocates a block containing the specified number of bytes from the symmetric heap. This routine is a collective operation and requires participation by all PEs.

type(c_devptr) function nvshmem_malloc(size)
    integer(8) :: size  ! Size is in bytes

Entities of type(c_devptr) can be cast as Fortran arrays in a few ways. Here are some examples:

use nvshmem
! Contiguous will avoid some runtime checks
real(8), device, pointer, contiguous :: array(:)
. . .
call c_f_pointer(nvshmem_malloc(N*8), array, [N])

use nvshmem
! Cray Pointer
real(8), device :: array(N); pointer(pa,array)
. . .
pa = transfer(nvshmem_malloc(N*8), pa)

nvshmem_free

This subroutine frees a block of symmetric data which was previously allocated.

subroutine nvshmem_free(ptr)
    ! ptr can be of type(c_devptr), or other types if it was cast to a Fortran
    ! array using the techniques described in the nvshmem_malloc section.

nvshmem_align

This function allocates a block from the symmetric heap that has a byte alignment specified by the alignment argument.

type(c_devptr) function nvshmem_align(alignment, size)
    integer(8) :: alignment
    integer(8) :: size  ! Size is in bytes

nvshmem_calloc

This function allocates a block containing the specified number of bytes from the symmetric heap. This routine is a collective operation and requires participation by all PEs. The space is also initialized to zero.

type(c_devptr) function nvshmem_calloc(size)
    integer(8) :: size  ! Size is in bytes

nvshmemx_buffer_register

This function registers the given buffer with the remote transport and with CUDA for subsequent nvshmem operations. The address passed as the first argument can be of any type, with host or device attributes.

integer(4) function nvshmemx_buffer_register(addr, len)
    real(4) :: addr(*)  ! Any type is accepted
    integer(8) :: len   ! Size is in bytes

nvshmemx_buffer_unregister

This function unregisters the buffer which was previously registered with the nvshmemx_buffer_register library function.

integer(4) function nvshmemx_buffer_unregister(addr)
    real(4) :: addr(*)  ! Any type is accepted

nvshmemx_buffer_unregister_all

This subroutine unregisters all buffers which were previously registered with the nvshmemx_buffer_register library function.

subroutine nvshmemx_buffer_unregister_all()

NVSHMEM Remote Memory Access Functions

This section contains the Fortran interfaces to NVSHMEM functions used to perform reads and writes to symmetric data objects. The CUDA C library contains a number of functions for each C type. We have tried to distill those down into a useful, but non-redundant set for Fortran programmers. In addition, we have provided the following generic interfaces which are overloaded to take multiple types:

nvshmem_put
nvshmem_p
nvshmem_iput
nvshmem_put_nbi
nvshmemx_put_block
nvshmemx_put_warp
nvshmem_get
nvshmem_g
nvshmem_iget
nvshmem_get_nbi
nvshmemx_get_block
nvshmemx_get_warp

Many of these functions are available on both the host and device. Certain programming models may not currently support generic functions on the device. Some of the functions are only available on the device (most notably, those performed by a whole block or whole warp).

nvshmem_put

This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_put is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

subroutine nvshmem_putmem(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_putmem_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int8_put(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_put_on_stream(dest, source, nelems, pe, stream)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_put(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_put_on_stream(dest, source, nelems, pe, stream)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_put(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_put_on_stream(dest, source, nelems, pe, stream)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_put(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_put_on_stream(dest, source, nelems, pe, stream)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_put(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_put_on_stream(dest, source, nelems, pe, stream)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_put(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_put_on_stream(dest, source, nelems, pe, stream)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_complex_put(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_put_on_stream(dest, source, nelems, pe, stream)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_dcomplex_put(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_put_on_stream(dest, source, nelems, pe, stream)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

The following nvshmem put subroutines are not part of the generic nvshmem_put group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmem_put8(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put8_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put16(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put16_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put32(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put32_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put64(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put64_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put128(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put128_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmem_p

This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_p is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device, and the source is passed by value and should be host-resident or device-resident, respectively. The specific names and argument lists are below.

subroutine nvshmem_int8_p(dest, source, nelems, pe)
    integer(1), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_p_on_stream(dest, source, nelems, pe, stream)
    integer(1), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_p(dest, source, nelems, pe)
    integer(2), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_p_on_stream(dest, source, nelems, pe, stream)
    integer(2), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_p(dest, source, nelems, pe)
    integer(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_p_on_stream(dest, source, nelems, pe, stream)
    integer(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_p(dest, source, nelems, pe)
    integer(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_p_on_stream(dest, source, nelems, pe, stream)
    integer(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_p(dest, source, nelems, pe)
    real(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_p_on_stream(dest, source, nelems, pe, stream)
    real(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_p(dest, source, nelems, pe)
    real(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_p_on_stream(dest, source, nelems, pe, stream)
    real(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmem_iput

This subroutine provides a way to copy strided data elements to a destination. This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_iput is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.

subroutine nvshmem_int8_iput(dest, source, dst, sst, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int8_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(1), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_iput(dest, source, dst, sst, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int16_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(2), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_iput(dest, source, dst, sst, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int32_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_iput(dest, source, dst, sst, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int64_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_iput(dest, source, dst, sst, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_float_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    real(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_iput(dest, source, dst, sst, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_double_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    real(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_complex_iput(dest, source, dst, sst, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_complex_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    complex(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_dcomplex_iput(dest, source, dst, sst, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
    complex(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

The following nvshmem iput subroutines are not part of the generic nvshmem_iput group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmem_iput8(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iput8_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iput16(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iput16_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iput32(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iput32_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iput64(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iput64_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iput128(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iput128_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmem_put_nbi

This subroutine returns after initiating the put operation. The subroutine nvshmem_put_nbi is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.

subroutine nvshmem_int8_put_nbi(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_put_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_put_nbi(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_put_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_put_nbi(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_put_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_put_nbi(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_put_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_put_nbi(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_put_nbi_on_stream(dest, source, nelems, pe, stream)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_put_nbi(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_put_nbi_on_stream(dest, source, nelems, pe, stream)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_complex_put_nbi(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_put_nbi_on_stream(dest, source, nelems, pe, stream)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_dcomplex_put_nbi(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_put_nbi_on_stream(dest, source, nelems, pe, stream)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

The following nvshmem put_nbi subroutines are not part of the generic nvshmem_put_nbi group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmem_put8_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put8_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put16_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put16_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put32_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put32_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put64_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put64_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_put128_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put128_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmemx_put_block

This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_put_block is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

subroutine nvshmemx_putmem_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_put_block(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_put_block(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_put_block(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_put_block(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_fp16_put_block(dest, source, nelems, pe)
    real(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_put_block(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_put_block(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_put_block(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_put_block(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

The following nvshmem put block subroutines are not part of the generic nvshmemx_put_block group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmemx_int_put_block(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_long_put_block(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put8_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put16_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put32_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put64_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put128_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

nvshmemx_put_warp

This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_put_warp is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

subroutine nvshmemx_putmem_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_put_warp(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_put_warp(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_put_warp(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_put_warp(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_fp16_put_warp(dest, source, nelems, pe)
    real(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_put_warp(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_put_warp(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_put_warp(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_put_warp(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

The following nvshmem put warp subroutines are not part of the generic nvshmemx_put_warp group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmemx_int_put_warp(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_long_put_warp(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put8_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put16_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put32_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put64_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_put128_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

nvshmem_get

This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_get is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

subroutine nvshmem_getmem(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_getmem_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int8_get(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_get_on_stream(dest, source, nelems, pe, stream)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_get(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_get_on_stream(dest, source, nelems, pe, stream)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_get(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_get_on_stream(dest, source, nelems, pe, stream)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_get(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_get_on_stream(dest, source, nelems, pe, stream)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_get(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_get_on_stream(dest, source, nelems, pe, stream)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_get(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_get_on_stream(dest, source, nelems, pe, stream)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_complex_get(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_get_on_stream(dest, source, nelems, pe, stream)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_dcomplex_get(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_get_on_stream(dest, source, nelems, pe, stream)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

The following nvshmem get subroutines are not part of the generic nvshmem_get group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmem_get8(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get8_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get16(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get16_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get32(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get32_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get64(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get64_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get128(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get128_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmem_g

This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_g is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device, and the source is passed by value and should be host-resident or device-resident, respectively. The specific names and argument lists are below.

subroutine nvshmem_int8_g(dest, source, nelems, pe)
    integer(1), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_g_on_stream(dest, source, nelems, pe, stream)
    integer(1), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_g(dest, source, nelems, pe)
    integer(2), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_g_on_stream(dest, source, nelems, pe, stream)
    integer(2), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_g(dest, source, nelems, pe)
    integer(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_g_on_stream(dest, source, nelems, pe, stream)
    integer(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_g(dest, source, nelems, pe)
    integer(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_g_on_stream(dest, source, nelems, pe, stream)
    integer(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_g(dest, source, nelems, pe)
    real(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_g_on_stream(dest, source, nelems, pe, stream)
    real(4), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_g(dest, source, nelems, pe)
    real(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_g_on_stream(dest, source, nelems, pe, stream)
    real(8), device :: dest(*), source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmem_iget

This subroutine provides a way to copy strided data elements to a destination. This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_iget is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.

subroutine nvshmem_int8_iget(dest, source, dst, sst, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int8_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(1), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_iget(dest, source, dst, sst, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int16_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(2), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_iget(dest, source, dst, sst, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int32_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_iget(dest, source, dst, sst, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_int64_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    integer(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_iget(dest, source, dst, sst, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_float_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    real(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_iget(dest, source, dst, sst, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_double_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    real(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_complex_iget(dest, source, dst, sst, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_complex_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    complex(4), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_dcomplex_iget(dest, source, dst, sst, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
    complex(8), device :: dest(*), source(*)
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

The following nvshmem iget subroutines are not part of the generic nvshmem_iget group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmem_iget8(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iget8_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iget16(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iget16_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iget32(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iget32_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iget64(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iget64_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_iget128(dest, source, dst, sst, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe

subroutine nvshmemx_iget128_on_stream(dest, source, dst, sst, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: dst, sst, nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmem_get_nbi

This subroutine returns after initiating the get operation. The subroutine nvshmem_get_nbi is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.

subroutine nvshmem_int8_get_nbi(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_get_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int16_get_nbi(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_get_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int32_get_nbi(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_get_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_get_nbi(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_get_nbi_on_stream(dest, source, nelems, pe, stream)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_float_get_nbi(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_get_nbi_on_stream(dest, source, nelems, pe, stream)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_double_get_nbi(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_get_nbi_on_stream(dest, source, nelems, pe, stream)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_complex_get_nbi(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_get_nbi_on_stream(dest, source, nelems, pe, stream)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_dcomplex_get_nbi(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_get_nbi_on_stream(dest, source, nelems, pe, stream)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

The following nvshmem get_nbi subroutines are not part of the generic nvshmem_get_nbi group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmem_get8_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get8_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get16_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get16_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get32_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get32_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get64_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get64_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_get128_nbi(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get128_nbi_on_stream(dest, source, nelems, pe, stream)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe
    integer(cuda_stream_kind) :: stream

nvshmemx_get_block

This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_get_block is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

subroutine nvshmemx_getmem_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_get_block(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_get_block(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_get_block(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_get_block(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_fp16_get_block(dest, source, nelems, pe)
    real(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_get_block(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_get_block(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_get_block(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_get_block(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

The following nvshmem get block subroutines are not part of the generic nvshmemx_get_block group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmemx_int_get_block(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_long_get_block(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get8_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get16_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get32_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get64_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get128_block(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

nvshmemx_get_warp

This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_get_warp is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

subroutine nvshmemx_getmem_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int8_get_warp(dest, source, nelems, pe)
    integer(1), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int16_get_warp(dest, source, nelems, pe)
    integer(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int32_get_warp(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_int64_get_warp(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_fp16_get_warp(dest, source, nelems, pe)
    real(2), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_float_get_warp(dest, source, nelems, pe)
    real(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_double_get_warp(dest, source, nelems, pe)
    real(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_complex_get_warp(dest, source, nelems, pe)
    complex(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_dcomplex_get_warp(dest, source, nelems, pe)
    complex(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

The following nvshmem get warp subroutines are not part of the generic nvshmemx_get_warp group, but are provided for flexibility and for compatibility with the C names:

subroutine nvshmemx_int_get_warp(dest, source, nelems, pe)
    integer(4), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_long_get_warp(dest, source, nelems, pe)
    integer(8), device :: dest(*), source(*)
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get8_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get16_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get32_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get64_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

subroutine nvshmemx_get128_warp(dest, source, nelems, pe)
    type(c_devptr) :: dest, source
    integer(8) :: nelems
    integer(4) :: pe

NVSHMEM Collective Communication Functions

This section contains the Fortran interfaces to NVSHMEM functions that perform coordinated communication or synchronization operations within a group of PEs. The section can be further divided between barrier and sync functions, all-to-all, broadcast, and collect functions, and reductions.

nvshmem_barrier, nvshmem_barrier_all

These subroutines perform a collective synchronization over all (nvshmem_barrier_all) or a provided subset (nvshmem_barrier) of PEs. Ordering APIs initiated on the CPU only order communication operations that were issued from the CPU. Use cudaDeviceSynchronize() or something similar to ensure GPU operations have completed. The list of subroutine names and argument lists are below.

subroutine nvshmem_barrier_all()

subroutine nvshmemx_barrier_all_on_stream(stream)
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_barrier(pe_start, pe_stride, pe_size, psync)
    integer(4) :: pe_start, pe_stride, pe_size
    integer(8), device :: psync(*)

subroutine nvshmemx_barrier_on_stream(pe_start, pe_stride, pe_size, psync, stream)
    integer(4) :: pe_start, pe_stride, pe_size
    integer(8), device :: psync(*)
    integer(cuda_stream_kind) :: stream

nvshmem_sync, nvshmem_sync_all

These subroutines perform a collective synchronization over all (nvshmem_sync_all) or a provided subset (nvshmem_sync) of PEs. Unlike the barrier routines, these subroutines only ensure completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates. The list of subroutine names and argument lists are below.

subroutine nvshmem_sync_all()

subroutine nvshmemx_sync_all_on_stream(stream)
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_sync(pe_start, pe_stride, pe_size, psync)
    integer(4) :: pe_start, pe_stride, pe_size
    integer(8), device :: psync(*)

subroutine nvshmemx_sync_on_stream(pe_start, pe_stride, pe_size, psync, stream)
    integer(4) :: pe_start, pe_stride, pe_size
    integer(8), device :: psync(*)
    integer(cuda_stream_kind) :: stream

nvshmem_alltoall

These functions perform a collective all-to-all operation over a team. Starting in nvshmem version 2.0, the specific names for collective operations take a team argument and are specific to the type. These functions exchange the specified number of data elements with all other PEs in the team. These generic names are supported in the Fortran interfaces: nvshmem_alltoall, nvshmemx_alltoall_block, and nvshmemx_alltoall_warp. The nvshmem_alltoall functions are callable from host or device, the nvshmemx_alltoall_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_alltoall(team, dest, source, nelems)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_int16_alltoall(team, dest, source, nelems)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_int32_alltoall(team, dest, source, nelems)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_int64_alltoall(team, dest, source, nelems)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_float_alltoall(team, dest, source, nelems)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_double_alltoall(team, dest, source, nelems)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int8_alltoall_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_alltoall_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_alltoall_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_alltoall_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_alltoall_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_alltoall_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_alltoall_block(team, dest, source, nelems)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int16_alltoall_block(team, dest, source, nelems)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int32_alltoall_block(team, dest, source, nelems)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int64_alltoall_block(team, dest, source, nelems)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_float_alltoall_block(team, dest, source, nelems)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_double_alltoall_block(team, dest, source, nelems)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int8_alltoall_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int16_alltoall_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int32_alltoall_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int64_alltoall_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_float_alltoall_warp(team, dest, source, nelems)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_double_alltoall_warp(team, dest, source, nelems)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems

nvshmem_broadcast

These functions perform a collective broadcast operation over a team. Starting in nvshmem version 2.0, the specific names for collective operations take a team argument and are specific to the type. These functions send the specified number of elements of source data from the specified root to all other PEs in the team. These generic names are supported in the Fortran interfaces: nvshmem_broadcast, nvshmemx_broadcast_block, and nvshmemx_broadcast_warp. The nvshmem_broadcast functions are callable from host or device, the nvshmemx_broadcast_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_broadcast(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmem_int16_broadcast(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmem_int32_broadcast(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmem_int64_broadcast(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmem_float_broadcast(team, dest, source, nelems, pe_root)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmem_double_broadcast(team, dest, source, nelems, pe_root)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int8_broadcast_on_stream(team, &
        dest, source, nelems, pe_root, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_broadcast_on_stream(team, &
        dest, source, nelems, pe_root, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_broadcast_on_stream(team, &
        dest, source, nelems, pe_root, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_broadcast_on_stream(team, &
        dest, source, nelems, pe_root, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_broadcast_on_stream(team, &
        dest, source, nelems, pe_root, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_broadcast_on_stream(team, &
        dest, source, nelems, pe_root, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_broadcast_block(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int16_broadcast_block(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int32_broadcast_block(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int64_broadcast_block(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_float_broadcast_block(team, dest, source, nelems, pe_root)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_double_broadcast_block(team, dest, source, nelems, pe_root)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int8_broadcast_warp(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int16_broadcast_warp(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int32_broadcast_warp(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_int64_broadcast_warp(team, dest, source, nelems, pe_root)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_float_broadcast_warp(team, dest, source, nelems, pe_root)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

integer function nvshmemx_double_broadcast_warp(team, dest, source, nelems, pe_root)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems
    integer(4) :: pe_root

nvshmem_collect

These functions perform a collective operation to concatenate the specified number of elements from each source array into the dest array for each PE in the team. Starting in nvshmem version 2.0, the specific names for collective operations take a team argument and are specific to the type. The collected data is in order of the PE in the team, and nelems can vary from PE to PE. These generic names are supported in the Fortran interfaces: nvshmem_collect, nvshmemx_collect_block, and nvshmemx_collect_warp. The nvshmem_collect functions are callable from host or device, the nvshmemx_collect_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_collect(team, dest, source, nelems)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_int16_collect(team, dest, source, nelems)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_int32_collect(team, dest, source, nelems)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_int64_collect(team, dest, source, nelems)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_float_collect(team, dest, source, nelems)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmem_double_collect(team, dest, source, nelems)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int8_collect_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_collect_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_collect_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_collect_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_collect_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_collect_on_stream(team, &
        dest, source, nelems, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_collect_block(team, dest, source, nelems)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int16_collect_block(team, dest, source, nelems)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int32_collect_block(team, dest, source, nelems)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int64_collect_block(team, dest, source, nelems)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_float_collect_block(team, dest, source, nelems)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_double_collect_block(team, dest, source, nelems)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int8_collect_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int16_collect_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int32_collect_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_int64_collect_warp(team, dest, source, nelems)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_float_collect_warp(team, dest, source, nelems)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nelems

integer function nvshmemx_double_collect_warp(team, dest, source, nelems)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nelems

NVSHMEM Reductions

This section contains the Fortran interfaces to NVSHMEM functions that perform reductions, which are synchronization operations within a group of PEs performing a bitwise or arithmetic operation, reducing a set of values down to one.

nvshmem_and_reduce

These functions perform a bitwise AND reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for the reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_and_reduce, nvshmemx_and_reduce_block, and nvshmemx_and_reduce_warp. The nvshmem_and_reduce functions are callable from host or device, the nvshmemx_and_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_and_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_and_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_and_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_and_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_and_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_and_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_and_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_and_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_and_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_and_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_and_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_and_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_and_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_and_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_and_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_and_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

nvshmem_or_reduce

These functions perform a bitwise OR reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for the reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_or_reduce, nvshmemx_or_reduce_block, and nvshmemx_or_reduce_warp. The nvshmem_or_reduce functions are callable from host or device, the nvshmemx_or_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_or_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_or_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_or_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_or_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_or_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_or_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_or_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_or_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_or_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_or_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_or_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_or_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_or_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_or_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_or_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_or_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

nvshmem_xor_reduce

These functions perform a bitwise XOR reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for the reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_xor_reduce, nvshmemx_xor_reduce_block, and nvshmemx_xor_reduce_warp. The nvshmem_xor_reduce functions are callable from host or device, the nvshmemx_xor_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_xor_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_xor_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_xor_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_xor_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_xor_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_xor_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_xor_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_xor_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_xor_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_xor_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_xor_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_xor_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_xor_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_xor_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_xor_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_xor_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

nvshmem_max_reduce

These functions perform a maximum value, MAX, reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_max_reduce, nvshmemx_max_reduce_block, and nvshmemx_max_reduce_warp. The nvshmem_max_reduce functions are callable from host or device, the nvshmemx_max_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_max_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_max_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_max_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_max_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_float_max_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_double_max_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_max_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_max_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_max_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_max_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_max_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_max_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_max_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_max_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_max_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_max_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_max_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_max_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_max_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_max_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_max_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_max_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_max_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_max_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

nvshmem_min_reduce

These functions perform a minimum value, MIN, reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_min_reduce, nvshmemx_min_reduce_block, and nvshmemx_min_reduce_warp. The nvshmem_min_reduce functions are callable from host or device, the nvshmemx_min_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_min_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_min_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_min_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_min_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_float_min_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_double_min_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_min_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_min_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_min_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_min_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_min_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_min_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_min_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_min_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_min_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_min_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_min_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_min_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_min_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_min_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_min_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_min_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_min_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_min_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

nvshmem_sum_reduce

These functions perform a summation, or SUM, reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_sum_reduce, nvshmemx_sum_reduce_block, and nvshmemx_sum_reduce_warp. The nvshmem_sum_reduce functions are callable from host or device, the nvshmemx_sum_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_sum_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_sum_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_sum_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_sum_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_float_sum_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_double_sum_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_sum_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_sum_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_sum_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_sum_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_sum_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_sum_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_sum_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_sum_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_sum_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_sum_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_sum_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_sum_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_sum_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_sum_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_sum_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_sum_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_sum_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_sum_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

nvshmem_prod_reduce

These functions perform a product reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_prod_reduce, nvshmemx_prod_reduce_block, and nvshmemx_prod_reduce_warp. The nvshmem_prod_reduce functions are callable from host or device, the nvshmemx_prod_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.

integer function nvshmem_int8_prod_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int16_prod_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int32_prod_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_int64_prod_reduce(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_float_prod_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmem_double_prod_reduce(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_prod_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int16_prod_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int32_prod_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int64_prod_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_float_prod_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_double_prod_reduce_on_stream(team, &
        dest, source, nreduce, stream)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce
    integer(cuda_stream_kind) :: stream

integer function nvshmemx_int8_prod_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_prod_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_prod_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_prod_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_prod_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_prod_reduce_block(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int8_prod_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(1), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int16_prod_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(2), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int32_prod_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_int64_prod_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    integer(8), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_float_prod_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(4), device :: dest, source
    integer(8) :: nreduce

integer function nvshmemx_double_prod_reduce_warp(team, dest, source, nreduce)
    integer(4) :: team
    real(8), device :: dest, source
    integer(8) :: nreduce

NVSHMEM Point to Point Synchronization Functions

This section contains the Fortran interfaces to NVSHMEM functions that provide a mechanism for synchronization between two PEs based on a value in the symmetric memory.

nvshmem_wait_until

This subroutine blocks until the value contained in the symmetric data object at the calling PE satisfies the condition specified by the comparison operator and the comparison value. The subroutine nvshmem_wait_until is overloaded to take a number of different sets of arguments. The specific names and argument lists are below. The variants that take a stream are available from host code. The variants that do not take a stream argument are device code only.

subroutine nvshmem_int32_wait_until(ivar, cmp, cvalue)
    integer(4), device :: ivar
    integer(4) :: cmp, cvalue

subroutine nvshmemx_int32_wait_until_on_stream(ivar, cmp, cvalue, stream)
    integer(4), device :: ivar
    integer(4) :: cmp, cvalue
    integer(cuda_stream_kind) :: stream

subroutine nvshmem_int64_wait_until(ivar, cmp, cvalue)
    integer(8), device :: ivar
    integer(4) :: cmp
    integer(8) :: cvalue

subroutine nvshmemx_int64_wait_until_on_stream(ivar, cmp, cvalue, stream)
    integer(8), device :: ivar
    integer(4) :: cmp
    integer(8) :: cvalue
    integer(cuda_stream_kind) :: stream

nvshmem_wait_until_all

These device subroutines block until all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The subroutine nvshmem_wait_until_all is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

subroutine nvshmem_int32_wait_until_all(ivars, nelems, status, cmp, cvalue)
    integer(4), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

subroutine nvshmem_int64_wait_until_all(ivars, nelems, status, cmp, cvalue)
    integer(8), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_wait_until_any

These device functions block until any one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The value returned is the index in the array that satisfies the condition. The function nvshmem_wait_until_any is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_wait_until_any(ivars, nelems, status, cmp, cvalue)
    integer(4), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_wait_until_any(ivars, nelems, status, cmp, cvalue)
    integer(8), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_wait_until_some

These device functions wait until at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_wait_until_some is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_wait_until_some(ivars, nelems, indices, status, cmp, cvalue)
    integer(4), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_wait_until_some(ivars, nelems, indices, status, cmp, cvalue)
    integer(8), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_wait_until_all_vector

These device subroutines block until all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The subroutine nvshmem_wait_until_all_vector is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

subroutine nvshmem_int32_wait_until_all_vector(ivars, nelems, status, cmp, cvalues)
    integer(4), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

subroutine nvshmem_int64_wait_until_all_vector(ivars, nelems, status, cmp, cvalues)
    integer(8), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_wait_until_any_vector

These device functions block until any one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The value returned is the index in the array that satisfies the condition. The function nvshmem_wait_until_any_vector is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_wait_until_any_vector(ivars, nelems, status, cmp, cvalues)
    integer(4), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_wait_until_any_vector(ivars, nelems, status, cmp, cvalues)
    integer(8), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_wait_until_some_vector

These device functions wait until at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_wait_until_some_vector is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_wait_until_some_vector(ivars, nelems, indices, status, cmp, cvalues)
    integer(4), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_wait_until_some_vector(ivars, nelems, indices, status, cmp, cvalues)
    integer(8), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_test

These device functions test whether the value contained in the symmetric data object at the calling PE satisfies the condition specified by the comparison operator and the comparison value. The functions return “1” if the comparison evaluates to true, “0” otherwise. The function nvshmem_test is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.

integer(4) function nvshmem_int32_test(ivar, cmp, cvalue)
    integer(4), device :: ivar, cvalue
    integer(4) :: cmp

integer(4) function nvshmem_int64_test(ivar, cmp, cvalue)
    integer(8), device :: ivar, cvalue
    integer(4) :: cmp

nvshmem_test_all

These device functions test all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The functions return “1” if all comparisons evaluate to true, “0” otherwise. The function nvshmem_test_all is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(4) nvshmem_int32_test_all(ivars, nelems, status, cmp, cvalue)
    integer(4), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

integer(4) nvshmem_int64_test_all(ivars, nelems, status, cmp, cvalue)
    integer(8), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_test_any

These device functions test whether any value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The functions return “1” if any comparisons evaluate to true, “0” otherwise. The value returned is the index in the array that satisfies the condition. The function nvshmem_test_any is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_test_any(ivars, nelems, status, cmp, cvalue)
    integer(4), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_test_any(ivars, nelems, status, cmp, cvalue)
    integer(8), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_test_some

These device functions test whether at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_test_some is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_test_some(ivars, nelems, indices, status, cmp, cvalue)
    integer(4), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_test_some(ivars, nelems, indices, status, cmp, cvalue)
    integer(8), device :: ivars(*), cvalue
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_test_all_vector

These device functions test whether all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The functions return “1” if all comparisons evaluate to true, “0” otherwise. The function nvshmem_test_all_vector is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

subroutine nvshmem_int32_test_all_vector(ivars, nelems, status, cmp, cvalues)
    integer(4), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

subroutine nvshmem_int64_test_all_vector(ivars, nelems, status, cmp, cvalues)
    integer(8), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_test_any_vector

These device functions test wheter any one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The value returned is the index in the array that satisfies the condition. The function nvshmem_test_any_vector is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_test_any_vector(ivars, nelems, status, cmp, cvalues)
    integer(4), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_test_any_vector(ivars, nelems, status, cmp, cvalues)
    integer(8), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(4) :: status(*)
    integer(4) :: cmp

nvshmem_test_some_vector

These device functions test whether at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_test_some_vector is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.

integer(8) function nvshmem_int32_test_some_vector(ivars, nelems, indices, status, cmp, cvalues)
    integer(4), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

integer(8) function nvshmem_int64_test_some_vector(ivars, nelems, indices, status, cmp, cvalues)
    integer(8), device :: ivars(*), cvalues(*)
    integer(8) :: nelems
    integer(8), intent(out) :: indices(nelems)
    integer(4) :: status(*)
    integer(4) :: cmp

NVSHMEM Memory Ordering Functions

This section contains the Fortran interfaces to NVSHMEM functions that provide mechanisms to ensure ordering and/or delivery of completion on NVSHMEM operations. Beginning with the 25.3 release, nvshmem_fence and nvshmem_quiet are also callable from Fortran device code.

nvshmem_fence

This subroutine ensures the ordering of delivery of operations on symmetric data objects.

subroutine nvshmem_fence()

nvshmem_quiet

These subroutines ensure completion of all operations on symmetric data objects issued by the calling PE.

subroutine nvshmem_quiet()

subroutine nvshmemx_quiet_on_stream(stream)
    integer(cuda_stream_kind) :: stream