NVIDIA® NVSHMEM 2.9.0 Release Notes

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a CUDA kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.9.0 and earlier releases.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

  • Improvements to the CMake build system. CMake is now the default build system and the Makefile build system is deprecated.

  • Added loadable network transport modules.

  • NVSHMEM device code can now be inlined to improve performance by enabling NVSHMEM_ENABLE_ALL_DEVICE_INLINING when building the NVSHMEM library.

  • Improvements to collective communication performance.

  • Updated libfabric transport to fragment messages larger than the maximum length supported by the provider.

  • Improvements to IBGDA transport, including large message support, user buffer registration, blocking g/get/amo performance, CUDA module support, and several bugfixes.

  • Introduced ABI compatibility for bootstrap modules. This release is backwards compatible with the ABI introduced in NVSHMEM 2.8.0.

  • Added NVSHMEM_BOOTSTRAP_*_PLUGIN environment variables that can be used to override the default filename used when opening each bootstrap plugin.

  • Improved error handling for GDRCopy.

  • Added a check to detect when the same number of PEs is not run on all nodes.

  • Added a check to detect availability of nvidia_peermem kernel module.

  • Reduced internal stream synchronizations to fix a compatibility bug with CUDA graph capture.

Compatibility

NVSHMEM 2.9.0 has been tested with the following:

CUDA Toolkit:

  • 11.0

  • 12.0

  • 12.1

  • On x86 and Power 9 processors

GPUS

  • Volta V100

  • Ampere A100

  • Hopper H100

Limitations

NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.

  • The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.

  • Libfabric support on Slingshot-11 networks requires setting the following environment variable, FI_CXI_OPTIMIZED_MRS=false.

  • VMM support is disabled by default on Power 9 systems because of a performance regression.

  • MPG support is not yet available on Power 9 systems.

  • Systems with PCIe peer-to-peer communication require one of the following:

  • InfiniBand to support NVSHMEM atomics APIs.

  • The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

  • NVSHMEM host APIs can be dynamically linked, but device APIs can only be statically linked.

  • This is because the linking of CUDA device symbols does not work across shared libraries.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.

  • They do not ensure global ordering and visibility.

  • When built with GDRcopy and when using Infiniband on older versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.

  • When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

  • With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.

NOTE: IBGDA does not work with DMABUF

Fixed Issues

  • A data consistency issue with CUDA graph capture support.

  • An issue in IBGDA that prevented us from supporting split buffers. Users no longer need to disable VMM or split buffers larger than 2GiB.

  • An issue preventing local buffer registration with IBGDA.

  • An issue preventing cumodule init with IBGDA.

Breaking Changes

Due to the name change of the IBGDA transport, all IBGDA related environment variables have changed. Please see the API docs and installation guide for more information.

Deprecated Features

N/A

Known Issues

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.