NVSHMEM Communication Library APIs
This section describes the Fortran interfaces to the NVSHMEM library. NVSHMEM is a software library that implements the OpenSHMEM application programming interface (API) for clusters of NVIDIA GPUs. OpenSHMEM is a community standard, one-sided communication API that provides a partitioned global address space (PGAS) parallel programming model. NVSHMEM provides an easy-to-use host-side interface for allocating symmetric memory, which can be distributed across a cluster of NVIDIA GPUs interconnected with NVLink, PCIe, and InfiniBand. The NVSHMEM communication functions are accessible from both host and device code. Most of the runtime API routines are written as C void functions, and we have implemented their Fortran wrappers as subroutines.
The NVSHMEM interfaces and definitions described in this chapter can be exposed by adding the line
use nvshmem
to your program unit. The same module is used for both host and device code. Device functions which run on a thread block or warp are declared as acc vector nohost
routines. Others are acc seq
routines.
Unless a specific kind is provided, the plain integer type used in the interfaces implies integer(4) and the plain real type implies real(4).
NVSHMEM Definitions, Setup, Exit, and Query Functions
This section contains definitions and data types used in the NVSHMEM library and interfaces to the NVSHMEM initialization and access to the parallel environment of the PEs.
The Fortran NVSHMEM module contains the following constant and derived type definitions:
! These are not available to the user, internal only
! defines, from nvshmemx_api.h
#define INIT_HANDLE_BYTES 128
! defines, from nvshmem_constants.h
#define SYNC_SIZE 27648
! Constant Definitions
integer, parameter :: NVSHMEM_SYNC_VALUE = 0
integer, parameter :: NVSHMEM_SYNC_SIZE = (2 * SYNC_SIZE)
integer, parameter :: NVSHMEM_BARRIER_SYNC_SIZE = (2 * SYNC_SIZE)
integer, parameter :: NVSHMEM_BCAST_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_REDUCE_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_REDUCE_MIN_WRKDATA_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_COLLECT_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEM_ALLTOALL_SYNC_SIZE = SYNC_SIZE
integer, parameter :: NVSHMEMX_CMP_EQ = 0
integer, parameter :: NVSHMEMX_CMP_NE = 1
integer, parameter :: NVSHMEMX_CMP_GT = 2
integer, parameter :: NVSHMEMX_CMP_LE = 3
integer, parameter :: NVSHMEMX_CMP_LT = 4
integer, parameter :: NVSHMEMX_CMP_GE = 5
integer, parameter :: NVSHMEMX_THREAD_SINGLE = 0
integer, parameter :: NVSHMEMX_THREAD_FUNNELED = 1
integer, parameter :: NVSHMEMX_THREAD_SERIALIZED = 2
integer, parameter :: NVSHMEMX_THREAD_MULTIPLE = 3
integer, parameter :: NVSHMEM_TEAM_INVALID = -1
integer, parameter :: NVSHMEM_TEAM_WORLD = 0
integer, parameter :: NVSHMEM_TEAM_SHARED = 1
integer, parameter :: NVSHMEMX_TEAM_NODE = 2
integer, parameter :: NVSHMEMX_INIT_THREAD_PES = 1
integer, parameter :: NVSHMEMX_INIT_WITH_MPI_COMM = 2
integer, parameter :: NVSHMEMX_INIT_WITH_SHMEM = 4
integer, parameter :: NVSHMEMX_INIT_WITH_HANDLE = 8
! Types from nvshmemx_api.h
type, bind(c) :: nvshmemx_init_handle
character(c_char) :: content(INIT_HANDLE_BYTES)
end type nvshmemx_init_handle
! Types from nvshmemx_api.h
type, bind(c) :: nvshmemx_init_attr_type
integer(8) heap_size
integer(4) num_threads
integer(4) n_pes
integer(4) my_pe
type(c_ptr) mpi_comm
type(nvshmemx_init_handle) handle
end type nvshmemx_init_attr_type
! Types from nvshmem_types.h
type, bind(c) :: nvshmem_team_config
integer(c_int) :: num_contexts
end type nvshmem_team_config
! nvshmemx_status, from nvshmem_error.h
type, bind(c) :: nvshmemx_status
integer(c_int) :: member
end type nvshmemx_status
type(nvshmemx_status), parameter :: &
NVSHMEMX_SUCCESS = nvshmemx_status(0), &
NVSHMEMX_ERROR_INVALID_VALUE = nvshmemx_status(1), &
NVSHMEMX_ERROR_OUT_OF_MEMORY = nvshmemx_status(2), &
NVSHMEMX_ERROR_NOT_SUPPORTED = nvshmemx_status(3), &
NVSHMEMX_ERROR_SYMMETRY = nvshmemx_status(4), &
NVSHMEMX_ERROR_GPU_NOT_SELECTED = nvshmemx_status(5), &
NVSHMEMX_ERROR_COLLECTIVE_LAUNCH_FAILED = nvshmemx_status(6), &
NVSHMEMX_ERROR_INTERNAL = nvshmemx_status(7)
nvshmem_init
This subroutine allocates and initializes resources used by the NVSHMEM library.
subroutine nvshmem_init()
nvshmemx_init_attr
This function initializes the NVSHMEM library based on an existing MPI communicator. Since the C and Fortran mpi_comm objects differ, this function has a different argument list than the corresponding C library entry point.
type(nvshmemx_status) function nvshmemx_init_attr(flags, comm)
integer(4) :: flags, comm
Here is an example of using this function with MPI
use nvshmem
type(nvshmemx_status) :: nvstat
. . .
! Setup MPI
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, nranks, ierror)
!
nvstat = nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, MPI_COMM_WORLD)
nvshmem_my_pe
This function returns the PE number of the calling PE, a number between 0 and npes-1.
integer(4) function nvshmem_my_pe()
nvshmem_n_pes
This function returns the number of PEs running in the program.
integer(4) function nvshmem_n_pes()
nvshmem_team_my_pe
This function returns the PE number of the calling PE, within the specified team.
integer(4) function nvshmem_team_my_pe(team)
integer(4) :: team
nvshmem_team_n_pes
This function returns the number of PEs in the specified team.
integer(4) function nvshmem_team_n_pes(team)
integer(4) :: team
nvshmem_team_get_config
This function returns the configuration parameters as described by the mask in the config argument.
integer(4) function nvshmem_team_get_config(team, mask, config)
integer(4) :: team
integer(8) :: mask
type(nvshmem_team_config) :: config
nvshmem_team_translate_pe
This function returns the translated destination pe given the source team, the source pe, and the destination team.
integer(4) function nvshmem_team_translate_pe(src_team, src_pe, dest_team)
integer(4) :: src_team, src_pe, dest_team
nvshmem_team_split_strided
This function performs a collective operation and creates a new team given a parent team and a desired slice (start, stride, and size) from the parent team.
integer(4) function nvshmem_team_split_strided(parent_team, &
start, stride, size, config, mask, new_team)
integer(4) :: parent_team
integer(4) :: start, stride, size
type(nvshmem_team_config) :: config
integer(8) :: mask
integer(4) :: new_team
nvshmem_team_split_2d
This function performs a collective operation and creates two new teams given a parent team and a specification of the 2D space. The result is two teams containing the PEs which map to the 2D space’s row and column.
integer(4) function nvshmem_team_split_2d(parent_team, &
xrange, xaxis_config, xaxis_mask, xaxis_team, &
yaxis_config, yaxis_mask, yaxis_team)
integer(4) :: parent_team
integer(4) :: xrange
type(nvshmem_team_config) :: xaxis_config, yaxis_config
integer(8) :: xaxis_mask, yaxis_mask
integer(4), intent(out) :: xaxis_team, yaxis_team
nvshmem_team_destroy
This function is a collective operation which destroys the team and frees the resouces associated with it.
integer(4) function nvshmem_team_destroy(team)
integer(4) :: team
nvshmem_info_get_version
This subroutine returns the major and minor version number of the NVSHMEM library.
subroutine nvshmem_info_get_version(major, minor)
integer(4) :: major, minor
nvshmem_info_get_name
This subroutine returns the vendor-defined name string for the library.
subroutine nvshmem_info_get_name(name)
character*256, intent(out) :: name
nvshmem_finalize
This subroutine releases resources and ends the NVSHMEM portion of a program started with nvshmem_init().
subroutine nvshmem_finalize()
nvshmem_ptr
This function returns a local address that may be used to directly reference the destination data on the specified PE. The function nvshmem_ptr is implemented as a Fortran generic function, and can take any datatype, as long as it is a symmetric address.
type(c_devptr) function nvshmem_ptr(dest, pe)
! dest can be of type integer, logical, real, complex, character,
! or a type(c_devptr)
integer(4) :: pe
The following specific functions are also supported:
type(c_devptr) function nvshmem_ptri(dest, pe)
integer :: dest ! Any kind and rank
integer(4) :: pe
type(c_devptr) function nvshmem_ptrl(dest, pe)
logical :: dest ! Any kind and rank
integer(4) :: pe
type(c_devptr) function nvshmem_ptrr(dest, pe)
real :: dest ! Any kind and rank
integer(4) :: pe
type(c_devptr) function nvshmem_ptrc(dest, pe)
complex :: dest ! Any kind and rank
integer(4) :: pe
type(c_devptr) function nvshmem_ptrc1(dest, pe)
character :: dest ! Any kind and rank
integer(4) :: pe
type(c_devptr) function nvshmem_ptrcd(dest, pe)
type(c_devptr) :: dest
integer(4) :: pe
NVSHMEM Memory Management Functions
This section contains the Fortran interfaces to NVSHMEM functions used to manage the symmetric heap.
nvshmem_malloc
This function allocates a block containing the specified number of bytes from the symmetric heap. This routine is a collective operation and requires participation by all PEs.
type(c_devptr) function nvshmem_malloc(size)
integer(8) :: size ! Size is in bytes
Entities of type(c_devptr) can be cast as Fortran arrays in a few ways. Here are some examples:
use nvshmem
! Contiguous will avoid some runtime checks
real(8), device, pointer, contiguous :: array(:)
. . .
call c_f_pointer(nvshmem_malloc(N*8), array, [N])
use nvshmem
! Cray Pointer
real(8), device :: array(N); pointer(pa,array)
. . .
pa = transfer(nvshmem_malloc(N*8), pa)
nvshmem_free
This subroutine frees a block of symmetric data which was previously allocated.
subroutine nvshmem_free(ptr)
! ptr can be of type(c_devptr), or other types if it was cast to a Fortran
! array using the techniques described in the nvshmem_malloc section.
nvshmem_align
This function allocates a block from the symmetric heap that has a byte alignment specified by the alignment argument.
type(c_devptr) function nvshmem_align(alignment, size)
integer(8) :: alignment
integer(8) :: size ! Size is in bytes
nvshmem_calloc
This function allocates a block containing the specified number of bytes from the symmetric heap. This routine is a collective operation and requires participation by all PEs. The space is also initialized to zero.
type(c_devptr) function nvshmem_calloc(size)
integer(8) :: size ! Size is in bytes
nvshmemx_buffer_register
This function registers the given buffer with the remote transport and with CUDA for subsequent nvshmem operations. The address passed as the first argument can be of any type, with host or device attributes.
integer(4) function nvshmemx_buffer_register(addr, len)
real(4) :: addr(*) ! Any type is accepted
integer(8) :: len ! Size is in bytes
nvshmemx_buffer_unregister
This function unregisters the buffer which was previously registered with the nvshmemx_buffer_register library function.
integer(4) function nvshmemx_buffer_unregister(addr)
real(4) :: addr(*) ! Any type is accepted
nvshmemx_buffer_unregister_all
This subroutine unregisters all buffers which were previously registered with the nvshmemx_buffer_register library function.
subroutine nvshmemx_buffer_unregister_all()
NVSHMEM Remote Memory Access Functions
This section contains the Fortran interfaces to NVSHMEM functions used to perform reads and writes to symmetric data objects. The CUDA C library contains a number of functions for each C type. We have tried to distill those down into a useful, but non-redundant set for Fortran programmers. In addition, we have provided the following generic interfaces which are overloaded to take multiple types:
nvshmem_put
nvshmem_p
nvshmem_iput
nvshmem_put_nbi
nvshmemx_put_block
nvshmemx_put_warp
nvshmem_get
nvshmem_g
nvshmem_iget
nvshmem_get_nbi
nvshmemx_get_block
nvshmemx_get_warp
Many of these functions are available on both the host and device. Certain programming models may not currently support generic functions on the device. Some of the functions are only available on the device (most notably, those performed by a whole block or whole warp).
nvshmem_put
This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_put
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
subroutine nvshmem_putmem(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_putmem_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int8_put(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_put_on_stream(dest, source, nelems, pe, stream)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_put(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_put_on_stream(dest, source, nelems, pe, stream)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_put(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_put_on_stream(dest, source, nelems, pe, stream)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_put(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_put_on_stream(dest, source, nelems, pe, stream)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_put(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_put_on_stream(dest, source, nelems, pe, stream)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_put(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_put_on_stream(dest, source, nelems, pe, stream)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_complex_put(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_put_on_stream(dest, source, nelems, pe, stream)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_dcomplex_put(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_put_on_stream(dest, source, nelems, pe, stream)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
The following nvshmem put subroutines are not part of the generic nvshmem_put group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmem_put8(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put8_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put16(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put16_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put32(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put32_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put64(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put64_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put128(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put128_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmem_p
This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_p
is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device, and the source is passed by value and should be host-resident or device-resident, respectively. The specific names and argument lists are below.
subroutine nvshmem_int8_p(dest, source, nelems, pe)
integer(1), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_p_on_stream(dest, source, nelems, pe, stream)
integer(1), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_p(dest, source, nelems, pe)
integer(2), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_p_on_stream(dest, source, nelems, pe, stream)
integer(2), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_p(dest, source, nelems, pe)
integer(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_p_on_stream(dest, source, nelems, pe, stream)
integer(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_p(dest, source, nelems, pe)
integer(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_p_on_stream(dest, source, nelems, pe, stream)
integer(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_p(dest, source, nelems, pe)
real(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_p_on_stream(dest, source, nelems, pe, stream)
real(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_p(dest, source, nelems, pe)
real(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_p_on_stream(dest, source, nelems, pe, stream)
real(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmem_iput
This subroutine provides a way to copy strided data elements to a destination. This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_iput
is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.
subroutine nvshmem_int8_iput(dest, source, dst, sst, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int8_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(1), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_iput(dest, source, dst, sst, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int16_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(2), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_iput(dest, source, dst, sst, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int32_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_iput(dest, source, dst, sst, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int64_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_iput(dest, source, dst, sst, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_float_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
real(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_iput(dest, source, dst, sst, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_double_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
real(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_complex_iput(dest, source, dst, sst, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_complex_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
complex(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_dcomplex_iput(dest, source, dst, sst, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_iput_on_stream(dest, source, dst, sst, nelems, pe, stream)
complex(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
The following nvshmem iput subroutines are not part of the generic nvshmem_iput group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmem_iput8(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iput8_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iput16(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iput16_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iput32(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iput32_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iput64(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iput64_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iput128(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iput128_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmem_put_nbi
This subroutine returns after initiating the put operation. The subroutine nvshmem_put_nbi
is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.
subroutine nvshmem_int8_put_nbi(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_put_nbi_on_stream(dest, source, nelems, pe, stream)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_put_nbi(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_put_nbi_on_stream(dest, source, nelems, pe, stream)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_put_nbi(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_put_nbi_on_stream(dest, source, nelems, pe, stream)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_put_nbi(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_put_nbi_on_stream(dest, source, nelems, pe, stream)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_put_nbi(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_put_nbi_on_stream(dest, source, nelems, pe, stream)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_put_nbi(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_put_nbi_on_stream(dest, source, nelems, pe, stream)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_complex_put_nbi(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_put_nbi_on_stream(dest, source, nelems, pe, stream)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_dcomplex_put_nbi(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_put_nbi_on_stream(dest, source, nelems, pe, stream)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
The following nvshmem put_nbi subroutines are not part of the generic nvshmem_put_nbi group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmem_put8_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put8_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put16_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put16_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put32_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put32_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put64_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put64_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_put128_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put128_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmemx_put_block
This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_put_block
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
subroutine nvshmemx_putmem_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_put_block(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_put_block(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_put_block(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_put_block(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_fp16_put_block(dest, source, nelems, pe)
real(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_put_block(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_put_block(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_put_block(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_put_block(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
The following nvshmem put block subroutines are not part of the generic nvshmemx_put_block group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmemx_int_put_block(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_long_put_block(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put8_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put16_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put32_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put64_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put128_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
nvshmemx_put_warp
This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_put_warp
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
subroutine nvshmemx_putmem_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_put_warp(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_put_warp(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_put_warp(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_put_warp(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_fp16_put_warp(dest, source, nelems, pe)
real(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_put_warp(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_put_warp(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_put_warp(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_put_warp(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
The following nvshmem put warp subroutines are not part of the generic nvshmemx_put_warp group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmemx_int_put_warp(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_long_put_warp(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put8_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put16_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put32_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put64_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_put128_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
nvshmem_get
This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_get
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
subroutine nvshmem_getmem(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_getmem_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int8_get(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_get_on_stream(dest, source, nelems, pe, stream)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_get(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_get_on_stream(dest, source, nelems, pe, stream)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_get(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_get_on_stream(dest, source, nelems, pe, stream)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_get(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_get_on_stream(dest, source, nelems, pe, stream)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_get(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_get_on_stream(dest, source, nelems, pe, stream)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_get(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_get_on_stream(dest, source, nelems, pe, stream)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_complex_get(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_get_on_stream(dest, source, nelems, pe, stream)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_dcomplex_get(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_get_on_stream(dest, source, nelems, pe, stream)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
The following nvshmem get subroutines are not part of the generic nvshmem_get group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmem_get8(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get8_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get16(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get16_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get32(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get32_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get64(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get64_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get128(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get128_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmem_g
This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_g
is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device, and the source is passed by value and should be host-resident or device-resident, respectively. The specific names and argument lists are below.
subroutine nvshmem_int8_g(dest, source, nelems, pe)
integer(1), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_g_on_stream(dest, source, nelems, pe, stream)
integer(1), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_g(dest, source, nelems, pe)
integer(2), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_g_on_stream(dest, source, nelems, pe, stream)
integer(2), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_g(dest, source, nelems, pe)
integer(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_g_on_stream(dest, source, nelems, pe, stream)
integer(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_g(dest, source, nelems, pe)
integer(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_g_on_stream(dest, source, nelems, pe, stream)
integer(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_g(dest, source, nelems, pe)
real(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_g_on_stream(dest, source, nelems, pe, stream)
real(4), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_g(dest, source, nelems, pe)
real(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_g_on_stream(dest, source, nelems, pe, stream)
real(8), device :: dest(*), source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmem_iget
This subroutine provides a way to copy strided data elements to a destination. This subroutine returns after the data has been copied out of the source array on the local PE. The subroutine nvshmem_iget
is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.
subroutine nvshmem_int8_iget(dest, source, dst, sst, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int8_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(1), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_iget(dest, source, dst, sst, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int16_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(2), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_iget(dest, source, dst, sst, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int32_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_iget(dest, source, dst, sst, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_int64_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
integer(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_iget(dest, source, dst, sst, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_float_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
real(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_iget(dest, source, dst, sst, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_double_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
real(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_complex_iget(dest, source, dst, sst, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_complex_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
complex(4), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_dcomplex_iget(dest, source, dst, sst, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_iget_on_stream(dest, source, dst, sst, nelems, pe, stream)
complex(8), device :: dest(*), source(*)
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
The following nvshmem iget subroutines are not part of the generic nvshmem_iget group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmem_iget8(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iget8_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iget16(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iget16_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iget32(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iget32_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iget64(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iget64_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_iget128(dest, source, dst, sst, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
subroutine nvshmemx_iget128_on_stream(dest, source, dst, sst, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: dst, sst, nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmem_get_nbi
This subroutine returns after initiating the get operation. The subroutine nvshmem_get_nbi
is overloaded to take a number of different sets of arguments. These subroutines can be called from either the host or device. The specific names and argument lists are below.
subroutine nvshmem_int8_get_nbi(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_get_nbi_on_stream(dest, source, nelems, pe, stream)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int16_get_nbi(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_get_nbi_on_stream(dest, source, nelems, pe, stream)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int32_get_nbi(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_get_nbi_on_stream(dest, source, nelems, pe, stream)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_get_nbi(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_get_nbi_on_stream(dest, source, nelems, pe, stream)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_float_get_nbi(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_get_nbi_on_stream(dest, source, nelems, pe, stream)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_double_get_nbi(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_get_nbi_on_stream(dest, source, nelems, pe, stream)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_complex_get_nbi(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_get_nbi_on_stream(dest, source, nelems, pe, stream)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_dcomplex_get_nbi(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_get_nbi_on_stream(dest, source, nelems, pe, stream)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
The following nvshmem get_nbi subroutines are not part of the generic nvshmem_get_nbi group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmem_get8_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get8_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get16_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get16_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get32_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get32_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get64_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get64_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
subroutine nvshmem_get128_nbi(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get128_nbi_on_stream(dest, source, nelems, pe, stream)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
integer(cuda_stream_kind) :: stream
nvshmemx_get_block
This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_get_block
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
subroutine nvshmemx_getmem_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_get_block(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_get_block(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_get_block(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_get_block(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_fp16_get_block(dest, source, nelems, pe)
real(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_get_block(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_get_block(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_get_block(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_get_block(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
The following nvshmem get block subroutines are not part of the generic nvshmemx_get_block group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmemx_int_get_block(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_long_get_block(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get8_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get16_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get32_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get64_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get128_block(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
nvshmemx_get_warp
This subroutine returns after the data has been copied out of the source array on the local PE. It is only available from device code. The subroutine nvshmemx_get_warp
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
subroutine nvshmemx_getmem_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int8_get_warp(dest, source, nelems, pe)
integer(1), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int16_get_warp(dest, source, nelems, pe)
integer(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int32_get_warp(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_int64_get_warp(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_fp16_get_warp(dest, source, nelems, pe)
real(2), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_float_get_warp(dest, source, nelems, pe)
real(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_double_get_warp(dest, source, nelems, pe)
real(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_complex_get_warp(dest, source, nelems, pe)
complex(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_dcomplex_get_warp(dest, source, nelems, pe)
complex(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
The following nvshmem get warp subroutines are not part of the generic nvshmemx_get_warp group, but are provided for flexibility and for compatibility with the C names:
subroutine nvshmemx_int_get_warp(dest, source, nelems, pe)
integer(4), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_long_get_warp(dest, source, nelems, pe)
integer(8), device :: dest(*), source(*)
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get8_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get16_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get32_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get64_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
subroutine nvshmemx_get128_warp(dest, source, nelems, pe)
type(c_devptr) :: dest, source
integer(8) :: nelems
integer(4) :: pe
NVSHMEM Collective Communication Functions
This section contains the Fortran interfaces to NVSHMEM functions that perform coordinated communication or synchronization operations within a group of PEs. The section can be further divided between barrier and sync functions, all-to-all, broadcast, and collect functions, and reductions.
nvshmem_barrier, nvshmem_barrier_all
These subroutines perform a collective synchronization over all (nvshmem_barrier_all
) or a provided subset (nvshmem_barrier
) of PEs. Ordering APIs initiated on the CPU only order communication operations that were issued from the CPU. Use cudaDeviceSynchronize() or something similar to ensure GPU operations have completed. The list of subroutine names and argument lists are below.
subroutine nvshmem_barrier_all()
subroutine nvshmemx_barrier_all_on_stream(stream)
integer(cuda_stream_kind) :: stream
subroutine nvshmem_barrier(pe_start, pe_stride, pe_size, psync)
integer(4) :: pe_start, pe_stride, pe_size
integer(8), device :: psync(*)
subroutine nvshmemx_barrier_on_stream(pe_start, pe_stride, pe_size, psync, stream)
integer(4) :: pe_start, pe_stride, pe_size
integer(8), device :: psync(*)
integer(cuda_stream_kind) :: stream
nvshmem_sync, nvshmem_sync_all
These subroutines perform a collective synchronization over all (nvshmem_sync_all
) or a provided subset (nvshmem_sync
) of PEs. Unlike the barrier routines, these subroutines only ensure completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates. The list of subroutine names and argument lists are below.
subroutine nvshmem_sync_all()
subroutine nvshmemx_sync_all_on_stream(stream)
integer(cuda_stream_kind) :: stream
subroutine nvshmem_sync(pe_start, pe_stride, pe_size, psync)
integer(4) :: pe_start, pe_stride, pe_size
integer(8), device :: psync(*)
subroutine nvshmemx_sync_on_stream(pe_start, pe_stride, pe_size, psync, stream)
integer(4) :: pe_start, pe_stride, pe_size
integer(8), device :: psync(*)
integer(cuda_stream_kind) :: stream
nvshmem_alltoall
These functions perform a collective all-to-all operation over a team. Starting in nvshmem version 2.0, the specific names for collective operations take a team argument and are specific to the type. These functions exchange the specified number of data elements with all other PEs in the team. These generic names are supported in the Fortran interfaces: nvshmem_alltoall, nvshmemx_alltoall_block, and nvshmemx_alltoall_warp. The nvshmem_alltoall functions are callable from host or device, the nvshmemx_alltoall_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_alltoall(team, dest, source, nelems)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer function nvshmem_int16_alltoall(team, dest, source, nelems)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer function nvshmem_int32_alltoall(team, dest, source, nelems)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer function nvshmem_int64_alltoall(team, dest, source, nelems)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer function nvshmem_float_alltoall(team, dest, source, nelems)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer function nvshmem_double_alltoall(team, dest, source, nelems)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int8_alltoall_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_alltoall_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_alltoall_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_alltoall_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_alltoall_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_alltoall_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_alltoall_block(team, dest, source, nelems)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int16_alltoall_block(team, dest, source, nelems)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int32_alltoall_block(team, dest, source, nelems)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int64_alltoall_block(team, dest, source, nelems)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_float_alltoall_block(team, dest, source, nelems)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_double_alltoall_block(team, dest, source, nelems)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int8_alltoall_warp(team, dest, source, nelems)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int16_alltoall_warp(team, dest, source, nelems)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int32_alltoall_warp(team, dest, source, nelems)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int64_alltoall_warp(team, dest, source, nelems)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_float_alltoall_warp(team, dest, source, nelems)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_double_alltoall_warp(team, dest, source, nelems)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
nvshmem_broadcast
These functions perform a collective broadcast operation over a team. Starting in nvshmem version 2.0, the specific names for collective operations take a team argument and are specific to the type. These functions send the specified number of elements of source data from the specified root to all other PEs in the team. These generic names are supported in the Fortran interfaces: nvshmem_broadcast, nvshmemx_broadcast_block, and nvshmemx_broadcast_warp. The nvshmem_broadcast functions are callable from host or device, the nvshmemx_broadcast_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_broadcast(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmem_int16_broadcast(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmem_int32_broadcast(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmem_int64_broadcast(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmem_float_broadcast(team, dest, source, nelems, pe_root)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmem_double_broadcast(team, dest, source, nelems, pe_root)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int8_broadcast_on_stream(team, &
dest, source, nelems, pe_root, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_broadcast_on_stream(team, &
dest, source, nelems, pe_root, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_broadcast_on_stream(team, &
dest, source, nelems, pe_root, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_broadcast_on_stream(team, &
dest, source, nelems, pe_root, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_broadcast_on_stream(team, &
dest, source, nelems, pe_root, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_broadcast_on_stream(team, &
dest, source, nelems, pe_root, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_broadcast_block(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int16_broadcast_block(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int32_broadcast_block(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int64_broadcast_block(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_float_broadcast_block(team, dest, source, nelems, pe_root)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_double_broadcast_block(team, dest, source, nelems, pe_root)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int8_broadcast_warp(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int16_broadcast_warp(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int32_broadcast_warp(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_int64_broadcast_warp(team, dest, source, nelems, pe_root)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_float_broadcast_warp(team, dest, source, nelems, pe_root)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
integer function nvshmemx_double_broadcast_warp(team, dest, source, nelems, pe_root)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer(4) :: pe_root
nvshmem_collect
These functions perform a collective operation to concatenate the specified number of elements from each source array into the dest array for each PE in the team. Starting in nvshmem version 2.0, the specific names for collective operations take a team argument and are specific to the type. The collected data is in order of the PE in the team, and nelems can vary from PE to PE. These generic names are supported in the Fortran interfaces: nvshmem_collect, nvshmemx_collect_block, and nvshmemx_collect_warp. The nvshmem_collect functions are callable from host or device, the nvshmemx_collect_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_collect(team, dest, source, nelems)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer function nvshmem_int16_collect(team, dest, source, nelems)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer function nvshmem_int32_collect(team, dest, source, nelems)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer function nvshmem_int64_collect(team, dest, source, nelems)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer function nvshmem_float_collect(team, dest, source, nelems)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer function nvshmem_double_collect(team, dest, source, nelems)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int8_collect_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_collect_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_collect_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_collect_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_collect_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_collect_on_stream(team, &
dest, source, nelems, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_collect_block(team, dest, source, nelems)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int16_collect_block(team, dest, source, nelems)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int32_collect_block(team, dest, source, nelems)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int64_collect_block(team, dest, source, nelems)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_float_collect_block(team, dest, source, nelems)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_double_collect_block(team, dest, source, nelems)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int8_collect_warp(team, dest, source, nelems)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int16_collect_warp(team, dest, source, nelems)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int32_collect_warp(team, dest, source, nelems)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_int64_collect_warp(team, dest, source, nelems)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_float_collect_warp(team, dest, source, nelems)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nelems
integer function nvshmemx_double_collect_warp(team, dest, source, nelems)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nelems
NVSHMEM Reductions
This section contains the Fortran interfaces to NVSHMEM functions that perform reductions, which are synchronization operations within a group of PEs performing a bitwise or arithmetic operation, reducing a set of values down to one.
nvshmem_and_reduce
These functions perform a bitwise AND reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for the reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_and_reduce, nvshmemx_and_reduce_block, and nvshmemx_and_reduce_warp. The nvshmem_and_reduce functions are callable from host or device, the nvshmemx_and_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_and_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_and_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_and_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_and_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_and_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_and_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_and_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_and_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_and_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_and_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_and_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_and_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_and_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_and_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_and_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_and_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
nvshmem_or_reduce
These functions perform a bitwise OR reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for the reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_or_reduce, nvshmemx_or_reduce_block, and nvshmemx_or_reduce_warp. The nvshmem_or_reduce functions are callable from host or device, the nvshmemx_or_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_or_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_or_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_or_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_or_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_or_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_or_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_or_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_or_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_or_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_or_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_or_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_or_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_or_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_or_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_or_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_or_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
nvshmem_xor_reduce
These functions perform a bitwise XOR reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for the reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_xor_reduce, nvshmemx_xor_reduce_block, and nvshmemx_xor_reduce_warp. The nvshmem_xor_reduce functions are callable from host or device, the nvshmemx_xor_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_xor_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_xor_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_xor_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_xor_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_xor_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_xor_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_xor_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_xor_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_xor_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_xor_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_xor_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_xor_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_xor_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_xor_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_xor_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_xor_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
nvshmem_max_reduce
These functions perform a maximum value, MAX, reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_max_reduce, nvshmemx_max_reduce_block, and nvshmemx_max_reduce_warp. The nvshmem_max_reduce functions are callable from host or device, the nvshmemx_max_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_max_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_max_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_max_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_max_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_float_max_reduce(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_double_max_reduce(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_max_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_max_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_max_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_max_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_max_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_max_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_max_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_max_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_max_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_max_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_max_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_max_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_max_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_max_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_max_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_max_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_max_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_max_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
nvshmem_min_reduce
These functions perform a minimum value, MIN, reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_min_reduce, nvshmemx_min_reduce_block, and nvshmemx_min_reduce_warp. The nvshmem_min_reduce functions are callable from host or device, the nvshmemx_min_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_min_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_min_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_min_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_min_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_float_min_reduce(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_double_min_reduce(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_min_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_min_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_min_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_min_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_min_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_min_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_min_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_min_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_min_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_min_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_min_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_min_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_min_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_min_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_min_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_min_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_min_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_min_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
nvshmem_sum_reduce
These functions perform a summation, or SUM, reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_sum_reduce, nvshmemx_sum_reduce_block, and nvshmemx_sum_reduce_warp. The nvshmem_sum_reduce functions are callable from host or device, the nvshmemx_sum_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_sum_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_sum_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_sum_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_sum_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_float_sum_reduce(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_double_sum_reduce(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_sum_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_sum_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_sum_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_sum_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_sum_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_sum_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_sum_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_sum_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_sum_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_sum_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_sum_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_sum_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_sum_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_sum_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_sum_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_sum_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_sum_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_sum_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
nvshmem_prod_reduce
These functions perform a product reduction across a set of PEs in a team. Starting in nvshmem version 2.0, the specific names for reduction operations take a team argument and are specific to the type. These generic names are supported in the Fortran interfaces: nvshmem_prod_reduce, nvshmemx_prod_reduce_block, and nvshmemx_prod_reduce_warp. The nvshmem_prod_reduce functions are callable from host or device, the nvshmemx_prod_reduce_on_stream functions are callable only from the host, and the block and warp functions are callable only from the device.
integer function nvshmem_int8_prod_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int16_prod_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int32_prod_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_int64_prod_reduce(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_float_prod_reduce(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmem_double_prod_reduce(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_prod_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int16_prod_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int32_prod_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int64_prod_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_float_prod_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_double_prod_reduce_on_stream(team, &
dest, source, nreduce, stream)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer(cuda_stream_kind) :: stream
integer function nvshmemx_int8_prod_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_prod_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_prod_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_prod_reduce_block(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_prod_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_prod_reduce_block(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int8_prod_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(1), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int16_prod_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(2), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int32_prod_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_int64_prod_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
integer(8), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_float_prod_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(4), device :: dest, source
integer(8) :: nreduce
integer function nvshmemx_double_prod_reduce_warp(team, dest, source, nreduce)
integer(4) :: team
real(8), device :: dest, source
integer(8) :: nreduce
NVSHMEM Point to Point Synchronization Functions
This section contains the Fortran interfaces to NVSHMEM functions that provide a mechanism for synchronization between two PEs based on a value in the symmetric memory.
nvshmem_wait_until
This subroutine blocks until the value contained in the symmetric data object at the calling PE satisfies the condition specified by the comparison operator and the comparison value. The subroutine nvshmem_wait_until
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below. The variants that take a stream are available from host code. The variants that do not take a stream argument are device code only.
subroutine nvshmem_int32_wait_until(ivar, cmp, cvalue)
integer(4), device :: ivar
integer(4) :: cmp, cvalue
subroutine nvshmemx_int32_wait_until_on_stream(ivar, cmp, cvalue, stream)
integer(4), device :: ivar
integer(4) :: cmp, cvalue
integer(cuda_stream_kind) :: stream
subroutine nvshmem_int64_wait_until(ivar, cmp, cvalue)
integer(8), device :: ivar
integer(4) :: cmp
integer(8) :: cvalue
subroutine nvshmemx_int64_wait_until_on_stream(ivar, cmp, cvalue, stream)
integer(8), device :: ivar
integer(4) :: cmp
integer(8) :: cvalue
integer(cuda_stream_kind) :: stream
nvshmem_wait_until_all
These device subroutines block until all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The subroutine nvshmem_wait_until_all
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
subroutine nvshmem_int32_wait_until_all(ivars, nelems, status, cmp, cvalue)
integer(4), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
subroutine nvshmem_int64_wait_until_all(ivars, nelems, status, cmp, cvalue)
integer(8), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_wait_until_any
These device functions block until any one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The value returned is the index in the array that satisfies the condition. The function nvshmem_wait_until_any
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_wait_until_any(ivars, nelems, status, cmp, cvalue)
integer(4), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_wait_until_any(ivars, nelems, status, cmp, cvalue)
integer(8), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_wait_until_some
These device functions wait until at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_wait_until_some
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_wait_until_some(ivars, nelems, indices, status, cmp, cvalue)
integer(4), device :: ivars(*), cvalue
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_wait_until_some(ivars, nelems, indices, status, cmp, cvalue)
integer(8), device :: ivars(*), cvalue
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_wait_until_all_vector
These device subroutines block until all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The subroutine nvshmem_wait_until_all_vector
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
subroutine nvshmem_int32_wait_until_all_vector(ivars, nelems, status, cmp, cvalues)
integer(4), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
subroutine nvshmem_int64_wait_until_all_vector(ivars, nelems, status, cmp, cvalues)
integer(8), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_wait_until_any_vector
These device functions block until any one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The value returned is the index in the array that satisfies the condition. The function nvshmem_wait_until_any_vector
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_wait_until_any_vector(ivars, nelems, status, cmp, cvalues)
integer(4), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_wait_until_any_vector(ivars, nelems, status, cmp, cvalues)
integer(8), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_wait_until_some_vector
These device functions wait until at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_wait_until_some_vector
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_wait_until_some_vector(ivars, nelems, indices, status, cmp, cvalues)
integer(4), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_wait_until_some_vector(ivars, nelems, indices, status, cmp, cvalues)
integer(8), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_test
These device functions test whether the value contained in the symmetric data object at the calling PE satisfies the condition specified by the comparison operator and the comparison value. The functions return “1” if the comparison evaluates to true, “0” otherwise. The function nvshmem_test
is overloaded to take a number of different sets of arguments. The specific names and argument lists are below.
integer(4) function nvshmem_int32_test(ivar, cmp, cvalue)
integer(4), device :: ivar, cvalue
integer(4) :: cmp
integer(4) function nvshmem_int64_test(ivar, cmp, cvalue)
integer(8), device :: ivar, cvalue
integer(4) :: cmp
nvshmem_test_all
These device functions test all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The functions return “1” if all comparisons evaluate to true, “0” otherwise. The function nvshmem_test_all
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(4) nvshmem_int32_test_all(ivars, nelems, status, cmp, cvalue)
integer(4), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
integer(4) nvshmem_int64_test_all(ivars, nelems, status, cmp, cvalue)
integer(8), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_test_any
These device functions test whether any value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The functions return “1” if any comparisons evaluate to true, “0” otherwise. The value returned is the index in the array that satisfies the condition. The function nvshmem_test_any
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_test_any(ivars, nelems, status, cmp, cvalue)
integer(4), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_test_any(ivars, nelems, status, cmp, cvalue)
integer(8), device :: ivars(*), cvalue
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_test_some
These device functions test whether at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_test_some
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_test_some(ivars, nelems, indices, status, cmp, cvalue)
integer(4), device :: ivars(*), cvalue
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_test_some(ivars, nelems, indices, status, cmp, cvalue)
integer(8), device :: ivars(*), cvalue
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_test_all_vector
These device functions test whether all values contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The functions return “1” if all comparisons evaluate to true, “0” otherwise. The function nvshmem_test_all_vector
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
subroutine nvshmem_int32_test_all_vector(ivars, nelems, status, cmp, cvalues)
integer(4), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
subroutine nvshmem_int64_test_all_vector(ivars, nelems, status, cmp, cvalues)
integer(8), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_test_any_vector
These device functions test wheter any one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The value returned is the index in the array that satisfies the condition. The function nvshmem_test_any_vector
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_test_any_vector(ivars, nelems, status, cmp, cvalues)
integer(4), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_test_any_vector(ivars, nelems, status, cmp, cvalues)
integer(8), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(4) :: status(*)
integer(4) :: cmp
nvshmem_test_some_vector
These device functions test whether at least one value contained in the wait set at the calling PE satisfy the condition specified by the comparison operator and the comparison value vector. The indices array, upon return, contains those indices which satisfy the condition. The function return value is the number of elements that satisfied the condition. The function nvshmem_test_some_vector
is overloaded to take two different sets of integer arguments. The specific names and argument lists are below.
integer(8) function nvshmem_int32_test_some_vector(ivars, nelems, indices, status, cmp, cvalues)
integer(4), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
integer(8) function nvshmem_int64_test_some_vector(ivars, nelems, indices, status, cmp, cvalues)
integer(8), device :: ivars(*), cvalues(*)
integer(8) :: nelems
integer(8), intent(out) :: indices(nelems)
integer(4) :: status(*)
integer(4) :: cmp
NVSHMEM Memory Ordering Functions
This section contains the Fortran interfaces to NVSHMEM functions that provide mechanisms to ensure ordering and/or delivery of completion on NVSHMEM operations. Beginning with the 25.3 release, nvshmem_fence and nvshmem_quiet are also callable from Fortran device code.
nvshmem_fence
This subroutine ensures the ordering of delivery of operations on symmetric data objects.
subroutine nvshmem_fence()
nvshmem_quiet
These subroutines ensure completion of all operations on symmetric data objects issued by the calling PE.
subroutine nvshmem_quiet()
subroutine nvshmemx_quiet_on_stream(stream)
integer(cuda_stream_kind) :: stream