NVIDIA® NVSHMEM 2.11.0 Release Notes

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a CUDA kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.11.0 and earlier releases.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

  • Added experimental support for Multi-node NVLink (MNNVL) systems when all PEs are connected using the same NVLink network.

  • Added support for multiple ports for the same (or different) NICs for each PE in the IBGDA transport. This feature can be enabled using the NVSHMEM_IBGDA_ENABLE_MULTI_PORT runtime environment variable.

  • Added support for the sockets-based bootstrapping of NVSHMEM jobs using the nvshmemx_get_uniqueid and nvshmemx_set_attr_uniqueid_args unique ID-based initialization API routines and nvshmemx_init_attr (NVSHMEMX_INIT_WITH_UNIQUEID, &attr).

  • Added the nvshmemx_hostlib_init_attr API that allows the NVSHMEM host library-only initialization. This feature is useful for applications that use only the NVSHMEM host API and do not statically link the NVSHMEM device library.

  • Added support to dynamically link the NVSHMEM host library using dlopen().

  • Introduced nvshmemx_vendor_get_version_info, a new API that queries the runtime library version and allows checks against the NVSHMEM_VENDOR_MAJOR_VERSION, NVSHMEM_VENDOR_MINOR_VERSION, and NVSHMEM_VENDOR_PATCH_VERSION compile time constants.

  • Added the NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE runtime environment variable to get full API support with Multi-Process per GPU (MPG) runs even when CUDA_MPS_ACTIVE_THREAD_PERCENTAGE is not set to 1/PEs.

  • Improved the throughpout and bandwidth performance of the IBGDA transport.

  • Improved the performance of the nvshmemx_quiet_on_stream API with the IBGDA transport by leveraging multiple CUDA threads to perform the IBGDA quiet operation.

  • Enabled relaxed ordering by default for InfiniBand transports.

  • Added the NVSHMEM_IB_ENABLE_RELAXED_ORDERING runtime environment variable that can be set to 0 to disable relaxed ordering.

  • Increased the number of threads that are launched to execute the nvshmemx_<typename>_<op>_reduce_on_stream API.

  • Added the NVSHMEM_DISABLE_DMABUF runtime environment variable to disable dmabuf usage.

  • Added a fix in the IBGDA transport that allows message transfers that are larger than the maximum size that is supported by a NIC work request.

  • Includes code refactoring and bug fixes.

Compatibility

NVSHMEM 2.11.0 has been tested with the following:

CUDA Toolkit:

  • 11.0

  • 11.8

  • 12.0

  • 12.2

  • 12.4

CPUS

  • On x86, Power 9, and Grace processors

GPUS

  • Volta V100

  • Ampere A100

  • Hopper H100

Limitations

  • NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.

    • Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2 or directly by using the MPI or SHMEM bootstraps.

    • PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM.

  • The libfabric transport does not yet support VMM, so VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM=1.

  • VMM support is disabled by default on Power 9 systems because of a performance regression.

  • MPG support is not available on Power 9 systems.

  • Systems with PCIe peer-to-peer communication require one of the following:

    • InfiniBand to support NVSHMEM atomics APIs.

    • The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between the source and destination PEs and do not ensure global ordering and visibility.

  • When built with GDRcopy and when using Infiniband on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver release 470 and later and in the latest 460 driver.

  • IBGDA transport support on NVIDIA Grace Hopper(TM) systems is experimental in this release.

  • IBGDA does not work with DMABUF.

Fixed Issues

  • In release 2.10.1, on DGX1V systems, a hang was introduced in the CUDA VMM path. This issue has been fixed in this release.

  • In earlier releases, a hang was introduced in minimal proxy service for nvshmem_global_exit on Grace Hopper systems. This issue has been fixed in this release.

Breaking Changes

There are no breaking changes in this release.

Deprecated Features

No features were deprecated in this release.

Known Issues

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported. For unsupported operations, refer to :ref:best practice guide <unsupported-operations>.