NVIDIA® NVSHMEM 3.2.5 Release Notes#

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a NVIDIA CUDA® kernel-side interface that allows NVIDIA CUDA® threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVIDIA® NVSHMEM 3.2.5 and earlier releases.

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

  • Enables platform support for Blackwell SM100 architecture on NVLINK5 connected B200-based systems.

  • Added one-shot and two-shot NVLINK SHARP (NVLS) allreduce algorithms for half-precision (float16, bfloat16) and full-precision (float32) datatypes on NVLINK4 and NVLINK5 enabled platforms.

  • Added automatic multi-SM accelerated on-stream collectives (fcollect, reducescatter, reduce) to improve NVLINK bandwidth on NVLINK4 and NVLINK5 enabled platforms for achieving 8x/16x speedup for medium to large-message size (>=1MB) compared to prior implementations.

  • Added a new LLVM IR-complaint bitcode device library to support MLIR-compliant compiler toolchain integration on new and upcoming Python DSLs (Triton, Mosaic, Numba, etc). This feature enhances perftest to enable testing of LLVM IR-compliant bitcode device library using runtime configuration NVSHMEM_TEST_CUBIN_LIBRARY.

  • Enhanced NVSHMEM host/device side collective and pt-to-pt operations by introducing a new command-line interface tool to improve runtime tunability of test parameters such as message size, datatype, reduce op, iterations, etc.

  • Improved heuristics for the automatic selection of on-stream NVLS collectives for fcollect, reducescatter, and reduce operations that span NVLINK-connected, GPU-based systems.

  • Eliminates dynamic link-time dependency on MPI and SHMEM on perftest and examples and replaces them with the dynamic load-time capability.

  • Added new example for ring based allreduce operation when GPUs are connected via remote interconnects (IB/RoCE/EFA/etc).

  • Added new example for fused alltoall and allgather operation (common in Mixure of Experts model) when GPUs are connected via P2P interconnects (NVLINK).

  • Fixed several minor bugs and memory leaks.

Compatibility#

NVSHMEM 3.2.5 has been tested with the following:

NVIDIA CUDA® Toolkit:

  • 11.8

  • 12.2

  • 12.8

CPUS

  • On x86 and NVIDIA GraceTM processors.

GPUS

  • Volta V100

  • Ampere A100

  • NVIDIA HopperTM

  • Blackwell B200

Limitations#

  • NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.

    • Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2 or directly by using the MPI or SHMEM bootstraps.

    • PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM.

  • The libfabric transport does not support VMM yet, so disable VMM by setting NVSHMEM_DISABLE_CUDA_VMM=1.

  • Systems with PCIe peer-to-peer communication requires one of the following:

    • InfiniBand to support NVSHMEM atomics APIs.

    • Using NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensures ordering and visibility between the source and destination PEs and does not ensure global ordering and visibility.

  • When built with GDRcopy, and when using Infiniband on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the latest NVIDIA CUDA® latest 460 driver and in release 470 and later.

  • IBGDA transport support on NVIDIA Grace HopperTM systems is experimental in this release.

  • IBGDA does not work with DMABUF.

  • IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).

  • NVSHMEM is not supported on Grace + Ada L40 platforms.

Fixed Issues#

  • Fixed a bug that was related to incorrect bus bandwidth reporting in shmem_p_bw, shmem_g_bw, shmem_atomic_bw, shmem_put_bw, and shmem_get_bw perftests.

  • Fixed a bug that was related to rounding error in NVLS reducescatter min and max operation due to incorrect usage of vectorized float16 instead of uint32 datatypes.

  • Fixed a bug that was related to dynamic loading of an unversioned bootstrap library.

  • Fixed a bug that was related to linking CMake projects to system installer packages.

  • Fixed a bug that was related to building heterogenous version of device library.

  • Fixed a bug that was related to establishing QP connection in IBGDA transport when using Dynamic Connection (DC) mode.

  • Fixed a bug that was related to building perftests for earlier CUDA versions (for example, 11.8) that do not support half-precision datatypes (for example, __nv_bfloat16).

  • Fixed a bug that was related to ABI compatibility breakage for allreduce maxloc op.

  • Fixed a bug that was related to non-deterministic hang when mixing nvshmemx_team_split_strided with nvshmemx_barrier_all_on_stream operation back-to-back.

  • Fixed a bug that was related to out-of-memory (OOM) during dynamic device memory based symmetric heap reservation on platforms with > 8 NVLINK connected GPUs.

  • Fixed a documentation bug that was related to incorrect usage of MPI_Bcast and unversioned nvshmemx_init_attr_t structure when initialization NVSHMEM using unique ID.

Breaking Changes#

There are no breaking changes in this release.

Deprecated Features#

There are no deprecated features in this release.

Known Issues#

  • Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.

  • When enabling libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, certain operations are experimental and might result in the application kernel hanging with the following operations:

    • Device side nvshmem_put/nvshmem_get with nvshmem_barrier.

    • Host side nvshmem_put_on_stream/nvshmem_get_on_stream.

  • When enabling UCX remote transport with NVSHMEM_REMOTE_TRANSPORT=UCX, a data mismatch might be observed when scaling 32 PEs or more on DGX-2 platform.