Release Notes :: NVSHMEM Documentation

This is the NVIDIA^® NVSHMEM™ 2.5.0 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added multi-instance support in NVSHMEM.
NVSHMEM now builds as two libraries, libnvshmem_host.so andlibnvshmem_device.a, which allows an application to have multiple components (for example, shared libraries and the application) that use NVSHMEM.

Note: Support for the libnvshmem.a library still exists for legacy purposes but will be eventually removed.
Added the nvshmemx_init_status API to query the initialized state of NVSHMEM.
Added support for CUDA_VISIBLE_DEVICES.
Support for CUDA_VISIBLE_DEVICES is not yet available with CUDA VMM, so you must set NVSHMEM_DISABLE_CUDA_VMM=1.
Updated PMI and PMI-2 bootstraps to plug-ins.
Added the nvshmem-info utility to display information about the NVSHMEM library.
Fixed warnings when using NVSHMEM in applications that compile without the Relocatable Device Code (RDC) option.
Renamed internal variables to avoid potential conflicts with variables in the application.
Implemented the nvshmem_alltoallmem API.
Improved the GPU-to-NIC assignment logic for the Summit/Sierra supercomputer.
Fixed the host barrier API implementation for non-blocking on stream ( (*_nbi_on_stream)) point-to-point operations.
Updated descriptions for NVSHMEM environment variables that are displayed by using nvshmem-info or by setting NVSHMEM_INFO=1.

NVSHMEM 2.5.0 has been tested with the following:

VMM support is disabled by default on Power 9 systems because of a performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

There are no fixed issues in this release.

There are no breaking changes in this release.

NVSHMEM device APIs can only be statically linked.

This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 (or later) and in the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.