NVIDIA® NVSHMEM 3.3.9 Release Notes#

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a NVIDIA CUDA^® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVIDIA® NVSHMEM 3.3.9 and earlier releases.

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

Enabled GA platform support for Blackwell B200/GB200NVL72-based systems. Additionally, enabled SASS support for Ada architecture.
Added official Python language bindings (nvshmem4py) enabling symmetric memory management, on-stream RMA, and collective APIs to aid in development of custom kernels using symmetric memory and enable fine-grained communication in native Python. The nvshmem4py package is available via PyPI wheels/conda installers.
Added support for CUDA Templates for Linear Algebra Subroutines and Solvers (cuTLASS)-compliant tile-granular NVLS device-sided collectives to aid development of fused distributed GeMM kernels.
Added support for flexible team initialization API (nvshmemx_team_init) using an arbitrary set of PEs to enable non-linear, non-contiguous PE indexing, if desired.
Added support for symmetric user-buffer registration (nvshmemx_buffer_register_symmetric) to enable ML frameworks to “bring-your-own-buffer” (BYOB) for zero-copy communication kernels.
Added support for narrow types (float16, bfloat16), precision support for NVLS reducescatter collective, and LL8 fcollect algorithm for low-latency collectives.
Added support for device-side nvshmem_broadcastmem, nvshmem_fcollectmem APIs in the library.
Added support for CUDA module-independent loading using nvshmemx_culibrary_init.
Added support for leveraging multiple Queue-Pairs (QPs) on LAG bonded NICs for RDMA transports. You can use the NVSHMEM_IB_NUM_RC_PER_DEVICE environment variable to tune this value as desired.
Added support for randomizing QP assignment for multiple GPU endpoints when communicating over IBGDA transport.
Added CUDA graph capture capabilities to on-stream collectives’ performance benchmarks using --cudagraph command-line parameter.
Enabled host-side clang compilation support for NVSHMEM host library.
Improved GPU thread-occupancy for on-stream fcollect when utilizing NVLS and LL algorithms by 30%.
Improved multi-SM NVLS on-stream collectives to adapt gridDim as a function of NVLINK domain size.
Improved runtime detection of CUDA VMM support and fall back to legacy pinned memory allocation cudaMalloc when platform support is not available for VMM.
Improved resiliency of querying Global Identifier (GID) via sysfs for RoCE transports in containerized environment.
Improved perftest presentation layer to provide additional count column capturing total number of elements per operation, independent of datatype size.
Improved point-to-point signaling latency to always leverage CE-centric APIs cuStreamWriteValue/cuStreamWaitValue by 20%.

Compatibility#

NVSHMEM 3.3.9 has been tested with the following:

NVIDIA CUDA® Toolkit:

12.2
12.6
12.9

CPUs

On x86 and NVIDIA Grace^TM processors.

GPUs

NVIDIA Ampere A100
NVIDIA Hopper^TM
NVIDIA Blackwell

Limitations#

NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.
- Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2 or directly by using the MPI or SHMEM bootstraps.
- PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM.
The libfabric transport does not support VMM yet, so you must disable VMM by setting NVSHMEM_DISABLE_CUDA_VMM=1.
Systems with PCIe peer-to-peer communication require one of the following:
- InfiniBand to support NVSHMEM atomics APIs.
- Using NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
nvshmem_barrier*, nvshmem_quiet, or nvshmem_wait_until only ensures ordering and visibility between the source and destination PEs and does not ensure global ordering and visibility.
When built with GDRCopy, and when using InfiniBand on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in release 470 and later.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace + Ada L40 platforms.
NVSHMEM is not supported on virtualized environments (VM).
User buffers registered with nvshmemx_buffer_register_symmetric lack support for libfabric transport to perform GPU-GPU communication over Remote networks (EFA, Slingshot, etc.).
When registering Extended GPU memory (EGM) user buffers with nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. This can be achieved by selecting GPUs on a different NUMA domain using the CUDA_VISIBLE_DEVICES environment variable.

Fixed Issues#

Fixed a bug in perftest reporting when both datatype and reduceop are specified.
Fixed an application crash when nvshmemx_fcollect_on_stream attempts to use more CTA than available NVSHMEM teams.
Fixed an application crash when NVSHMEM remote transports attempt to use more than 16 HCAs per node.
Fixed an application crash in the nvshmemx_mc_ptr API that is caused when executed on a platform without NVLS support.
Fixed a compile-time error with LLVM IR bitcode device library when compiling with clang-llvm > 19.
Fixed a compile-time error with IBGDA support when built without GDRCopy support.
Fixed a compile-time error with moe_shuffle.cu caused by a missing getopt header.
Fixed a data corruption bug in device-side pt-to-pt get/put bandwidth test due to missing usage of non-symmetric memory buffers for bandwidth summarization.
Fixed a host clang compilation bug due to missing __CUDA_ARCH__ conditional check for non-CUDA device inline assembly code path.
Fixed a bug in the symmetric memory management layer that was caused by a missing override for NVSHMEM_CUMEM_GRANULARITY for static device memory heaps (cudaMalloc).
Fixed a data corruption bug in on-stream NVLS 2-shot allreduce, 1-shot allgather collective that was caused by missing memory fence to order data and barrier.
Fixed a stale sysfs file path used for nvidia-peermem discovery that caused regression when running with CUDA 12.8 or above.
Fixed a hang in MPI bootstrap allgather collective caused by incorrect usage of MPI_IN_PLACE operations.
Fixed a data correctness bug in the LLVM IR bitcode device library that caused incorrect results for 16-byte aligned put/get operations which had a size that was not a multiple of 16 bytes.

Breaking Changes#

There are no breaking changes in this release.

Deprecated Features#

Support for Volta V100 platform is now deprecated.

Known Issues#

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When enabling libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, certain operations are experimental and might result in the application kernel hanging with the following operations:
- Device side nvshmem_put/nvshmem_get with nvshmem_barrier.
- Host side nvshmem_put_on_stream/nvshmem_get_on_stream.
When enabling UCX remote transport with NVSHMEM_REMOTE_TRANSPORT=UCX, a data mismatch might be observed when scaling 32 PEs or more on DGX-2 platform.