NVIDIA® NVSHMEM 2.8.0 Release Notes#

Abstract#

NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.8.0 and earlier releases.

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

The transport formerly called GPU Initiated Communication (GIC) has

been renamed to InfiniBand GPUDirect Async (IBGDA) to reflect the underlying technology used by that transport.
Improvements to the all-to-all algorithm were made for both the

IBGDA and IBRC transports. These changes specifically focused on latency bound all-to-all operations.
Support for RC connections was added to IBGDA to optimize workloads

on small PE sets.

Compatibility#

NVSHMEM 2.8.0 has been tested with the following:

CUDA Toolkit:
- [11.0]{.underline}
- 11.8
- 12.0
On x86 and Power 9 processors

Limitations#

NVSHMEM is not yet compatible with the PMI client library on Cray

systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be

disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the

following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a

performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the

following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
NVSHMEM host APIs can be dynamically linked, but device APIs can

only be statically linked.
- This is because the linking of CUDA device symbols does not work > across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure

ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
- They do not ensure global ordering and visibility.
When built with GDRcopy and when using Infiniband on older versions

of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

IBGDA does not work with DMABUF.

Fixed Issues#

An issue in the IBGDA Transport which caused all GPUs on the same

host to use the same NIC.
The DMA-BUF registration issue is fixed in this release. Users no

longer need to limit their allocation granularity to work around that issue.

Breaking Changes#

Due to the name change of the IBGDA transport, all IBGDA related

environment variables have changed. Please see the API docs and installation guide for more information.

Deprecated Features#

n/a

Known Issues#

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT

at compile time, are not currently supported.
NVSHMEM buffers that span multiple physical memory regions are not

well-handled in IBGDA. To work-around this issue, either
- Set NVSHMEM_DISABLE_CUDA_VMM=1 and NVSHMEM_SYMMETRIC_SIZE=<size> where size is large enough to cover your NVSHMEM memory usage, or
- Set NVSHMEM_CUMEM_GRANULARITY=<size> such that it covers your application’s NVSHMEM memory consumption.
When using IBGDA, nvshmem_put, nvsmem_put_signal, and nvshmem_get do not support transferring data more than 2 GiB in one call.

Abstract#

NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.7.0 and earlier releases.