NVIDIA® NVSHMEM 3.0.6 Release Notes

NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a NVIDIA CUDA® kernel-side interface that allows NVIDIA CUDA® threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVIDIA® NVSHMEM 3.0.6 and earlier releases.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added support for Multi-node systems that have RDMA networks (IB, RoCE, Slingshot, and so on) and NVIDIA NVLink® as multi-node interconnects.
Added support for ABI backward compatibility between host and device libraries. In the same NVSHMEM major version, the later host library will continue to be compatible with earlier device library versions. The work involved minimizing ABI surface between host and device libraries and versioning of structs and functions that are part of the new ABI surface.
Enhance NVSHMEM’s memory management infrastructure using object oriented programming (OOP) framework with multi-level inheritance to manage support for various device memory types (STATIC, DYNAMIC) and to enable support for newer memory types in the future.
Added support for PTX testing.
Added support for CPU assisted IBGDA via the NIC handler to manage NIC doorbell. The NIC handler can now be selected through the new environment variable - NVSHMEM_IBGDA_NIC_HANDLER. This feature would enable IBGDA adoption on systems that do not have the PeerMappingOverride=1 driver setting.
Improved performance by 20-50% for IBGDA setup when scaling up number of PEs, by batching and minimizing number of memory registration invocations for IB control structures.
Enhanced support to compose NVSHMEM_TEAM_SHARED on Multi-node NVLink (MNNVL)-based systems.
Improved performance for block scoped reductions by parallelizing send/recv data when sending small size messages. Also, the NVSHMEM device code that was compiled with NVIDIA CUDA® 11.0 and std=c++17 will automatically use cooperative group reduction APIs to improve the performance of local reductions.
Added IBGDA support to automatically prefer RC over DC connected QPs and update the default values of NVSHMEM_IBGDA_NUM_RC_PER_PE/NVSHMEM_IBGDA_NUM_DCI to be 1.
Added assertions in DEVX and IBGDA transport for checking extended atomics support in the RDMA NICs.
Added support for no-collective synchronization action in nvshmem_malloc/calloc/align/free, to follow OpenSHMEM spec-compliant behavior, when requested size or buffer in heap is 0 and NULL respectively.
Added support for nvshmemx_fcollectmem/broadcastmem device and onstream interfaces.
Improved performance tracing for on-stream and host collectives performance benchmarks using the cudaEventElapsedTime instead of the gettimeofday API.
Added support for performance benchmark bootstrap_coll for various bootstrap modalities.
Added support for “Include-What-You-Use” (IWYU) framework in the CMake build system.
Fixed several minor bugs and memory leaks.

Compatibility

NVSHMEM 3.0.6 has been tested with the following:

NVIDIA CUDA® Toolkit:

11.8
12.2
12.4
12.5

CPUS

On x86 and NVIDIA Grace^TM processors.

GPUS

Volta V100
Ampere A100
NVIDIA Hopper^TM

Limitations

NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.
- Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2 or directly by using the MPI or SHMEM bootstraps.
- PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM.
The libfabric transport does not yet support VMM, so VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM=1.
Systems with PCIe peer-to-peer communication requires one of the following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensures ordering and visibility between the source and destination PEs and does not ensure global ordering and visibility.
When built with GDRcopy, and when using Infiniband on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with NVIDIA CUDA® driver release 470 and later and in the latest 460 driver.
IBGDA transport support on NVIDIA Grace Hopper^TM systems is experimental in this release.
IBGDA does not work with DMABUF.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace + Ada L40 platforms.

Fixed Issues

In an earlier release, an incorrect implementation of system-scoped atomic memory operations such as nvshmem_fence/atomic_<ops> and signaled operations nvshmem_signal_<op> when communicating over NVLink, existed. This issue has been fixed in this release.
In an earlier release, a bug existed in remote transports during memory registration and deregistration, with respect to memory handle management cache. This issue has been fixed in this release.
In an earlier release, a bug existed in the NVSHMEM_IBGDA_DCI_MAP_BY=warp or NVSHMEM_IBGDA_RC_MAP_BY=warp QP mapping options, which lead to the suboptimal QP mapping to warps/DCTs. This issue has been fixed in this release.
In an earlier release, a bug existed when dynamically loading non-versioned libcuda.so and libnvml.so. This issue has been fixed in this release.
In an earlier release, a bug existed in the computing NVSHMEM team symmetric heap memory requirements during runtime initialization. This issue has been fixed in this release.
In an earlier release, a bug existed that was related to stale filepaths when aborting an NVSHMEM runtime. This issue has been fixed in this release.
In an earlier release, a bug existed when building NVSHMEM remote transports with the HAVE_IBV_ACCESS_RELAXED_ORDERING set. This issue has been fixed in this release.
In an earlier release, a bug existed that exhibits the symptom of a GPU device hang, when using RC QP type with IBGDA. This issue has been fixed in this release.
In an earlier release, a bug existed with an incorrect value of the broadcast LL algorithm threshold. This issue has been fixed in this release.
In an earlier release, a bug existed in the IBDEVX transport that was related to an incorrect endianness check. This issue has been fixed in this release.
In an earlier release, a memory leak existed in nvshmem_team_destroy that was related to a missing teardown for two internal subteams for each user-created team. This issue has been fixed in this release.

Breaking Changes

There are no breaking changes in this release.

Deprecated Features

Removed support for deprecated Power-9 systems.
Removed support for deprecated makefile build system. NVSHMEM now supports only the CMake build system.

Known Issues

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When enabling libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, certain operations are experimental and hence may result in application kernel hang with the following operations:
- Device side nvshmem_put/nvshmem_get with nvshmem_barrier.
- Host side nvshmem_put_on_stream/nvshmem_get_on_stream.
When enabling UCX remote transport with NVSHMEM_REMOTE_TRANSPORT=UCX, there can be data mismatch observed when scaling 32 PEs or more on DGX-2 platform.