NVIDIA® NVSHMEM 3.1.7 Release Notes#

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a NVIDIA CUDA® kernel-side interface that allows NVIDIA CUDA® threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVIDIA® NVSHMEM 3.1.7 and earlier releases.

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

  • Added support for NVLINK SHARP (NVLS) based collective algorithms on x86 + Hopper and Grace Hopper architecture based single and multi-node NVLINK platforms for popular deep-learning collective communications (ReduceScatter, Allgather, and Allreduce) device and on-stream APIs. This feature improves latency for small-message size by a 2-3x speedup when compared with one-shot algorithms over NVLINK.

  • Added support for GPU kernels that want to use a low-level query API to NVLS enabled symmetric memory using nvshmemx_mc_ptr host and device API for a given target team.

  • Added support for new Low-Latency protocol (LL128) for Allgather collective communication device and on-stream APIs.

  • Enhanced support for existing low-latency protocol (LL) warp-scoped collective to provide a 2x speedup over traditional algorithms when scaling up number of GPUs up to 32.

  • Added support for the half-precision (FP16/BF16) format on collective communication (ReduceScatter, Allgather, and Allreduce) on-device and on-stream APIs.

  • Added support for Python wheels using the PyPI repository and the rpm/deb package distribution.

  • Added support for dynamic RDMA Global Identifier (GID) discovery for RoCE transports. This feature enables automatic fallback to the discovered GID without requiring users to specify the GID using a runtime variable.

  • Added support for a heterogenous library build system. This feature allows the NVSHMEM static library to be built with a separate CUDA version from the NVSHMEM host library. This enables new features, such as NVLS in the host library and allowing applications compiled against lower versions of CUDA to link to the NVSHMEM device library. This makes the entire library portable to different CUDA minor versions and remain feature complete. Users can specify a CUDA version for the device library by specifying NVSHMEM_DEVICELIB_CUDA_HOME=<PATH TO CUDA>. Otherwise the host CUDA version will be used.

  • Enhanced support for NVSHMEM on stream signal APIs to use cuStreamWriteValue over P2P connected GPUs when possible. This makes it possible to have a zero-SM implementation of the on stream signaling op when possible.

  • Added support for DMABuf-based registration of NIC control-structures in IBGDA to leverage the DMABuf mainline support in newer linux kernels over proprietary solution nvidia-peermem.

  • Added a sample code for NVSHMEM UniqueID (UID) socket based bootstrap modality under examples directory.

  • Added support for NVSHMEM performance benchmarks to the release binary packages.

  • Enhanced collectives performance reporting by adding the Algorithmic Bandwidth (algoBW) and Bus Bandwidth (BusBW) metrics to the NVSHMEM performance benchmarks.

  • Enhanced the reduce-based collective symmetric memory scratch space to 512KB to accommodate the additional space for the reducescatter based collectives.

  • Fixed support for the Ninja build generator in our CMake build system.

  • Added support for the symmetric memory management unit test framework.

  • Fixed several minor bugs and memory leaks.

Compatibility#

NVSHMEM 3.1.7 has been tested with the following:

NVIDIA CUDA® Toolkit:

  • 11.8

  • 12.2

  • 12.6

CPUS

  • On x86 and NVIDIA GraceTM processors.

GPUS

  • Volta V100

  • Ampere A100

  • NVIDIA HopperTM

Limitations#

  • NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.

    • Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2 or directly by using the MPI or SHMEM bootstraps.

    • PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM.

  • The libfabric transport does not yet support VMM, so VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM=1.

  • Systems with PCIe peer-to-peer communication requires one of the following:

    • InfiniBand to support NVSHMEM atomics APIs.

    • Using NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensures ordering and visibility between the source and destination PEs and does not ensure global ordering and visibility.

  • When built with GDRcopy, and when using Infiniband on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the latest VIDIA CUDA® latest 460 driver and in release 470 and later.

  • IBGDA transport support on NVIDIA Grace HopperTM systems is experimental in this release.

  • IBGDA does not work with DMABUF.

  • IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).

  • NVSHMEM is not supported on Grace + Ada L40 platforms.

Fixed Issues#

  • Fixed a bug that was related to the incorrect use of NVSHMEM_DEVICE_TIMEOUT_POLLLING build time variable.

  • Fixed a performance bug in on-stream collectives perftest related to using cudaMemcpyAsync on the same CUDA stream, where cudaEvent for profiling the start and end time of the on-stream communication kernel are submitted.

  • Fixed a bug that was related to virtual member functions of nvshmemi_symmetric_heap by forcing its access specifier to be protected to limit its access to only inherited child classes.

  • Fixed a bug that was related to the recursive destructor memory corruption and nullptr access to the static member function of the nvshmemi_mem_transport class.

  • Fixed a bug that was related to the incorrect compile-time definition for the NVML_GPU_FABRIC_STATE_COMPLETED and NVML_GPU_FABRIC_UUID_LEN constants.

  • Fixed a bug that was existed related to nvshmemx_collective_launch_query_gridsize, which could cause it to erroneously return a gridsize of 0.

  • Fixed a bug that was related to nvshmem_init, which might cause the application to crash in MNNVL discovery when used with the CUDA compat libraries at runtime for the CUDA toolkit 12.4 or later.

  • Fixed a bug that was related to nvshmemx_collective_launch, which might cause the duplicate initialization of the NVSHMEM device state.

  • Fixed a bug that was related to uninitialized variables in the IBGDA device code.

  • Fixed a bug that was related to out-of-bound access (OOB) in the atomic BW performance test.

  • Fixed a bug that was related to missing C/C++ stdint headers on Ubuntu24.04 and x86-based systems.

  • Fixed a bug that was related to the incorrect calculation of team-specific stride when creating a new team using nvshmem_team_split_strided.

Breaking Changes#

There are no breaking changes in this release.

Deprecated Features#

  • Removed host API-based nvshmem collectives performance benchmarks.

Known Issues#

  • Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.

  • When enabling libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, certain operations are experimental and might result in the application kernel hanging with the following operations:

    • Device side nvshmem_put/nvshmem_get with nvshmem_barrier.

    • Host side nvshmem_put_on_stream/nvshmem_get_on_stream.

  • When enabling UCX remote transport with NVSHMEM_REMOTE_TRANSPORT=UCX, a data mismatch might be observed when scaling 32 PEs or more on DGX-2 platform.