NVIDIA NVSHMEM 3.7.0 Release Notes#
NVIDIA® NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA® CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.7.0 and earlier releases.
Key Features and Enhancements#
The NVSHMEM release includes the following key features and enhancements:
Features
Added TMA-backed implementations for NVLink
putandgetoperations supporting global-memory and shared-memory local buffers, with shared-memory registration API (nvshmemx_ask_smem,nvshmemx_give_smem,nvshmemx_release_smem), selectable viaNVSHMEM_TMA_POLICY.Added GPUNetIO remote transport (
NVSHMEM_REMOTE_TRANSPORT=gpunetio) with GPU-initiated communication (NVSHMEM_GPUNETIO_ENABLE_GDAKI=1) and support for the DOCA SDK, see https://github.com/NVIDIA-DOCA/gpunetio and https://developer.nvidia.com/networking/doca.Added
nvshmemx_flushAPIs to provide source-buffer reusability without guaranteeing remote visibility.Added experimental logical endpoint/CFT handle support for fabric-PTX unicast communication, including introducing
NVSHMEM_TEAM_MC_SHARED.Added floating-point atomic
add/fetch_addAPIs (nvshmemx_{half,float,double}_atomic_{add,fetch_add}) with P2P and proxy-backed IBRC support.Added teams-based OpenSHMEM bootstrap support using
SHMEM_TEAM_WORLDwith fallback to legacy active-set collectives.Added support for combined CUDA VMM handle flags in symmetric buffer registration.
Added support for GID-based routing on InfiniBand networks with IBRC and IBGDA transports.
Added support for InfiniBand PKey index selection (
NVSHMEM_IB_PKEY_INDEX), QP ack timeout (NVSHMEM_IB_TIMEOUT), and retry count (NVSHMEM_IB_RETRY_CNT).Added CMake target for
nvidia-nvshmem-cuXXPython wheels viaNVSHMEM_BUILD_LIBS_WHEEL.Added multi-architecture device library support with a fatbin LTO-IR library and per-architecture LLVM bitcode libraries.
Improved diagnostics with human-readable status strings in error logs.
Improved header search for CUDA-version-agnostic CCCL distribution.
Switched to C++17 as the minimum required C++ version.
Changed licensing to Apache-2.0 and added DCO contributing guidance.
Bug Fixes
Fixed build portability issues around C++17/C++20 compilation, GNU extensions, and device-library CUDA architecture propagation.
Fixed non-RDC compilation for users that include
nvshmem.hwithout compiling with-rdc=true.Fixed hangs when PEs observed different NCCL availability during initialization.
Fixed NVLS capability detection on older CUDA drivers when
cuDeviceGetAttributereturnsCUDA_ERROR_INVALID_VALUE.Fixed legacy OpenSHMEM bootstrap handling of non-4-byte
allgather/alltoallpayloads.Fixed
nvshmem_ptrhandling to returnNULLinstead of segfaulting when symmetric heap is not initialized.Fixed IBDevX doorbell UMEM null-check handling.
Fixed IBGDA RC multi-port endpoint setup and CQ indexing across selected devices.
Fixed LTO-IR/bitcode build issues including LLVM 21 NVPTX intrinsic compatibility and CUDA architecture selection.
Fixed build issue causing redefinition of
mlx5dvmacros in certain environments.Fixed
alltoallblock-scoped warpquiethandling when no warps are unused.Fixed standalone test builds against RPM/DEB installs by using exported
find_packagevariables.Fixed memory semantics of the ring allreduce example.
Fixed NVLS multimem architecture gating and two-shot
tile_allreducesource-data ordering.Fixed IBRC GDRCopy teardown for sysmem handles used by CPU atomics.
Fixed bootstrap and common IB transport robustness issues, including a bootstrap helper double-free.
External Contributions
Added NUMA-aware CPU affinity pinning controlled via
NVSHMEM_CPU_AFFINITY. (AWS)Added
NVSHMEM_NETDEVS_POLICYto control NIC assignment policy. (AWS)Improved
libfabrictransport progress, signaling, staged atomics, and ack aggregation for EFA environments. (AWS)Added
libfabricRMA batching support with transport-level batching hints and controls. (AWS)Improved
libfabrictransport GDRCopy integration with opportunistic GDRCopy 2.5+ API loading andFORCE_PCIEsupport on coherent platforms. (AWS)
The NVSHMEM4Py 0.3.1 release includes the following:
Updated Numbast integration and dependency handling for newer Python/Numba-CUDA combinations, including Python 3.14 compatibility.
Removed hardcoded CUDA 13 build requirement.
Updated CuTe DSL RMA tensor tests to use Torch-backed tensors with DLPack conversion.
Fixed CuTe and Numba device collective generation and small-team handling, including
reducescattercooperative-launch bindings.Fixed several minor bugs in NVSHMEM4Py tests.
Compatibility#
NVSHMEM 3.7.0 has been tested with the following:
CUDA Toolkit:
12.8
12.9
13.2
13.3
CPUs:
x86 processors
NVIDIA Grace™ processors
GPUs:
NVIDIA Ampere
NVIDIA Hopper™
NVIDIA Blackwell
NCCL 2.30.4
Limitations#
NVSHMEM is not compatible with the PMI client library on Cray systems, and must use the NVSHMEM internal PMI-2 client library.
You must launch jobs with the PMI bootstrap by specifying
--mpi=pmi2to Slurm andNVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps.You must also set PMI-2 as the default PMI by setting
NVSHMEM_DEFAULT_PMI2=1when you build NVSHMEM.
The
libfabrictransport currently does not support VMM, so you must disable VMM by settingNVSHMEM_DISABLE_CUDA_VMM=1.Systems with PCIe peer-to-peer communication must do one of the following:
Provide InfiniBand to support NVSHMEM atomics API calls.
Use NVSHMEM’s UCX transport, which uses sockets for atomics if InfiniBand is absent.
nvshmem_barrier*,nvshmem_quiet, andnvshmem_wait_untilonly ensure ordering and visibility between the source and destination PEs. They do not ensure global ordering and visibility.When built with GDRCopy, and when using InfiniBand on versions of the 460 driver prior to 460.106.00, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release fixed in the CUDA release 460 driver from 460.106.00 forward.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace with Ada L40 platforms.
NVSHMEM is not supported in virtualized environments (VM).
User buffers registered with
nvshmemx_buffer_register_symmetriclack support forlibfabrictransport to perform GPU-GPU communication over remote networks (EFA, Slingshot, etc.).When registering Extended GPU memory (EGM) user buffers with
nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. You can achieve this by selecting GPUs on a different NUMA domain using theCUDA_VISIBLE_DEVICESenvironment variable.When using the Libfabric transport with
NVSHMEM_LIBFABRIC_PROVIDER=EFA, you must ensure that thelibfabricenvironment variableFI_EFA_ENABLE_SHM_TRANSFERis set to0before launching the application. While NVSHMEM sets this variable during initialization, it may be ignored by the EFA provider if it was already initialized by the launcher, for example when using mpirun.
Known Issues#
The internal layout of RC-connected QPs changed starting in 3.5.21, causing ABI compatibility breakage when enabling IBGDA.
Complex types, which are enabled by setting
NVSHMEM_COMPLEX_SUPPORTat compile time, are not currently supported.When you enable UCX remote transport with
NVSHMEM_REMOTE_TRANSPORT=UCX, you may observe a data mismatch when scaling 32 PEs or more on DGX-2 platform.