Is this page helpful?

NVIDIA® NVSHMEM 3.5.19 Release Notes#

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA^® CUDA^® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.4.5 and earlier releases.

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

Added qpair-specific API calls (nvshmemx_qp_*) that provide RMA operations on specific queue pairs abstracted via nvshmemx_qp_handle_t.
Added tile-granular RMA routines tile_put, tile_get, and tile_broadcast API calls.
Added LLVM bitcode library support for IBGDA.
Added option to pass the CUDA device to nvshmemx_init_attr to set a device when using NVSHMEM.
Added environment variable NVSHMEM_MAX_PEER_STREAMS to set the maximum number of CUDA streams per node.
Renamed tile_allreduce to tile_reduce and tile_reduce to tile_rooted_reduce to align with other NVSHMEM collectives.
Removed static-only version libnvshmem.a. Link instead to libnvshmem_host and libnvshmem_device.
Improved EFA transport (libfabric) with multiple bug fixes and performance improvements for AWS environments.
Updated default number of used QPs from 4 to 8 for full bandwidth with data direct NICs.
Changed default NVSHMEM_MAX_MEMORY_PER_GPU from 128 GiB to 256 GiB.
Improved NVSHMEM_HCA_PREFIX to accept ^ and updated default value.
Added CUTLASS support to tile API calls.
Removed realloc and alltoalls declarations because these functions are not implemented in NVSHMEM.
Updated hydra installation script to install version 4.3.2.
Improved error catching and reporting for initialization and synchronization routines.
Fixed race condition in barrier causing hangs on unordered networks.
Fixed reduce test validation issues in cases when PE count is not a power of 2.
Fixed stream memory operations to use cuStreamWriteValue only for self-writes.
Fixed nvshmem_calloc implementation to account for the count argument.
Fixed several minor bugs and memory leaks.

NVSHMEM4Py release 0.2.1 includes the following:

Added NVSHMEM4Py device API calls. Using the Numba-Cuda DSL, you can write fused compute-comms kernels in Python. NVSHMEM4Py device API calls cover collectives, one-sided RMA, and atomic memory operations (AMOs).
Added ability to allocate Fortran-memory-ordered arrays and tensors.
Removed requirement to explicitly set LD_LIBRARY_PATH to find NVSHMEM with cuda-pathfinder.
Added support for multicast memory buffers and array/tensor wrappers.
Numerous bug fixes and minor interoperability enhancements.

Compatibility#

NVSHMEM 3.5.19 has been tested with the following:

CUDA Toolkit:

12.4
12.9
13.0
13.1

CPUs:

x86 and NVIDIA Grace™ processors

GPUs:

NVIDIA Ampere A100
NVIDIA Hopper™
NVIDIA Blackwell^®

Limitations#

NVSHMEM is not compatible with the PMI client library on Cray systems, and must use the NVSHMEM internal PMI-2 client library.
- You can launch jobs with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps.
- You can also set PMI-2 as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when you build NVSHMEM.
The libfabric transport currently does not support VMM, so you must disable VMM by setting NVSHMEM_DISABLE_CUDA_VMM=1.
Systems with PCIe peer-to-peer communication must do one of the following:
- Provide InfiniBand to support NVSHMEM atomics API calls.
- Use NVSHMEM’s UCX transport, which uses sockets for atomics if InfiniBand is absent.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between the source and destination PEs. They do not ensure global ordering and visibility.
When built with GDRCopy, and when using InfiniBand on versions of the 460 driver prior to 460.106.00, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in drivers with version 470 and later.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace with Ada L40 platforms.
NVSHMEM is not supported in virtualized environments (VM).
User buffers registered with nvshmemx_buffer_register_symmetric lack support for libfabric transport to perform GPU-GPU communication over Remote networks (EFA, Slingshot, etc.).
When registering Extended GPU memory (EGM) user buffers with nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. This can be achieved by selecting GPUs on a different NUMA domain using the CUDA_VISIBLE_DEVICES environment variable.
When using the Libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, you must ensure that the libfabric environment variable FI_EFA_ENABLE_SHM_TRANSFER is set to 0 before launching their application. While NVSHMEM sets this variable during initialization, it may be ignored by the EFA provider if it was already initialized by the launcher, for example when using mpirun.
Due to the LLVM ecosystem’s CUDA support matrix, libnvshmem_device.bc bitcode library has only been qualified for use with CUDA toolkits with major version 12. Support for CUDA version 13 is experimental, and it is recommended that users build the NVSHMEM bitcode library from source with LLVM 22 for use with CUDA 13.

Deprecated Features#

Support for libnvshmem.a is now deprecated.

Known Issues#

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When enabling libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, certain operations are experimental and may cause the application kernel to hang in the following operations:
- Device side nvshmem_put/nvshmem_get with nvshmem_barrier
- Host side nvshmem_put_on_stream/nvshmem_get_on_stream
When you enable UCX remote transport with NVSHMEM_REMOTE_TRANSPORT=UCX, you may observe a data mismatch when scaling 32 PEs or more on DGX-2 platform.