Release Notes :: NVSHMEM Documentation

This is the NVIDIA^® NVSHMEM™ 2.2.1 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:

Implemented dynamic heap memory allocation for runs with P2P GPUs.

This feature, which requires CUDA version 11.3 or later, can be enabled by using NVSHMEM_DISABLE_CUDA_VMM=0. Support for IB runs will be added in the next release.
Improved UCX transport performance for AMO and RMA operations.
Improved performance for warp and block put/get operations.
Added atomic support for PCIe-connected GPUs over the UCX transport.
The UCX transport now supports non-symmetric buffers for use as local buffers in RMA and AMO operations.
Added support to initialize NVSHMEM in CUmodule.
Enabled MPI and PMIx bootstrap modules to be compiled externally from the NVSHMEM build.

This allows multiple builds of these plugins to support various MPI and PMIx libraries. To select the plugins, set NVSHMEM_BOOTSTRAP="plugin" and NVSHMEM_BOOTSTRAP_PLUGIN="plugin_name.so".

Note: The plugin sources are installed with the compiled NVSHMEM library.
Enabled MPI bootstrap to be used with nvshmem_init.

You can set NVSHMEM_BOOTSTRAP=MPI or use the bootstrap plugin method.
Fixed bugs in nvshmem_<typename>_g and the fetch atomics implementation.
Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match the OpenSHMEM specification.
Fixed a type of nreduce argument in the reduction API to size_t to match OpenSHMEM specification.
Improved NVSHMEM build times with a multi-threaded option in the CUDA compiler (requires CUDA version 11.2 and later).
Several fixes to address Coverity reports.

NVSHMEM 2.2.1 has been tested with the following:

Systems with PCIe peer-to-peer communication require one of the following:

InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

There are no fixed issues in this release.

Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match the OpenSHMEM specification.
Fixed a type of nreduce argument in the reduction API to size_t to match OpenSHMEM specification.
Removed support for host-side NVSHMEM wait APIs.

NVSHMEM can only be linked statically.

This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 (or later) and in the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.