NVIDIA® NVSHMEM 2.10.1 Release Notes

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a CUDA kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.10.1 and earlier releases.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

  • Support for single and multi-node Grace Hopper systems

  • Support for the EFA provider using the libfabric transport, which can be enabled with NVSHMEM_LIBFABRIC_PROVIDER=EFA. applications.

  • NVRTC support was added for the NVSHMEM device implementation headers.

  • Fixed memory leaks in nvshmem_finalize

  • Added support for calling nvshmem_init and nvshmem_finalize in a loop with any bootstrap. Previously the support had existed only for MPI bootstrap

  • Performance optimizations in Alltoall collective API

  • Implemented warp-level automated coalescing of nvshmem_<typename>_g operations to contiguous addresses in IBGDA transport

  • Removed redundant consistency operations in IBGDA transport

  • Added support for synchronized memory operations when using VMM API for NVSHMEM symmetric heap

  • Code refactoring and bug fixes

Compatibility

NVSHMEM 2.10.1 has been tested with the following:

CUDA Toolkit:

  • 11.0

  • 12.0

  • 12.2

  • On x86, Power 9, and Grace processors

GPUS

  • Volta V100

  • Ampere A100

  • Hopper H100

Limitations

  • NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.

  • The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.

  • Libfabric support on Slingshot-11 networks requires setting the following environment variable, FI_CXI_OPTIMIZED_MRS=false.

  • VMM support is disabled by default on Power 9 systems because of a performance regression.

  • MPG support is not yet available on Power 9 systems.

  • Systems with PCIe peer-to-peer communication require one of the following:

    • InfiniBand to support NVSHMEM atomics APIs.

    • The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

  • NVSHMEM host APIs can be dynamically linked, but device APIs can only be statically linked.

    • This is because the linking of CUDA device symbols does not work across shared libraries.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand. They do not ensure global ordering and visibility.

  • When built with GDRcopy and when using Infiniband on older versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.

  • When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

  • With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.

  • IBGDA transport support on Grace Hopper systems is experimental in this release.

  • IBGDA does not work with DMABUF.

Fixed Issues

  • In release 2.8, CMake was building NVSHMEM with a dynamic link to libcudart.so, and this issue has been fixed in this release.

Breaking Changes

Deprecated Features

Known Issues

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.