Release Notes :: NVSHMEM Documentation

NVSHMEM Release 2.8.0

Welcome to the NVIDIA^® NVSHMEM™ 2.8.0 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:

Provides CuFFT Support
Enhanced compatibility

Compatibility

NVSHMEM 2.8.0 has been tested with the following:

CUDA:
- 11.0
- 11.8
- 12.0
On x86 and Power 9 processors

Limitations

NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building.
The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the following environment variable
- FI_CXI_OPTIMIZED_MRS=false
- VMM support is disabled by default on Power 9 systems because of a performance regression.
- MPG support is not yet available on Power 9 systems.
- Systems with PCIe peer-to-peer communication require one of the following:
  - InfiniBand to support NVSHMEM atomics APIs.
  - The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
NVSHMEM host APIs can be dynamically linked, but device APIs can only be statically linked.
- This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_untilonly ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
- They do not ensure global ordering and visibility.
When built with GDRcopy and when using Infiniband on older versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.
IBGDA does not work with DMABUF.

Fixed Issues

An issue in the IBGDA Transport which caused all GPUs on the same host to use the same NIC.
The DMA-BUF registration issue is fixed in this release. Users no longer need to limit their allocation granularity to work around that issue.

Breaking Changes

Due to the name change of the IBGDA transport, all IBGDA related environment variables have changed. Please see the API docs and installation guide for more information.

Known Issues

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
DMA-BUF registrations are only supported with buffers up to 4 GiB. For heaps or registrations larger than 4 GiB, nvidia_peermem or nv_peer_mem must be used.