NVSHMEM Release 2.9.0
Welcome to the NVIDIA® NVSHMEM™ 2.9.0 release notes.
Key Features And Enhancements
- Improvements to the CMake build system. CMake is now the default build system and the Makefile build system is deprecated.
-
Added loadable network transport modules.
-
NVSHMEM device code can now be inlined to improve performance by enabling NVSHMEM_ENABLE_ALL_DEVICE_INLINING when building the NVSHMEM library.
-
Improvements to collective communication performance.
-
Updated libfabric transport to fragment messages larger than the maximum length supported by the provider.
-
Improvements to IBGDA transport, including large message support, user buffer registration, blocking g/get/amo performance, CUDA module support, and several bugfixes.
-
Introduced ABI compatibility for bootstrap modules. This release is backawards compatible with the ABI introduced in NVSHMEM 2.8.0.
-
Added NVSHMEM_BOOTSTRAP_*_PLUGIN environment variables that can be used to override the default filename used when opening each bootstrap plugin.
-
Improved error handling for GDRCopy.
-
Added a check to detect when the same number of PEs is not run on all nodes.
-
Added a check to detect availability of nvidia_peermem kernel module.
-
Reduced internal stream synchronizations to fix a compatibility bug with CUDA graph capture.
Compatibility
NVSHMEM 2.9.0 has been tested with the following:
Limitations
- NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building.
- The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
- Libfabric support on Slingshot-11 networks requires setting the following
environment variable:
- FI_CXI_OPTIMIZED_MRS=false
- VMM support is disabled by default on Power 9 systems because of a performance regression.
- MPG support is not yet available on Power 9 systems.
- Systems with PCIe peer-to-peer communication require one of the following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
- NVSHMEM host APIs can be dynamically linked, but device APIs can only
be statically linked.
- This is because the linking of CUDA device symbols does not work across shared libraries.
- nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_untilonly ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
- When built with GDRcopy and when using Infiniband on older versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.
- When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
- With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.
Fixed Issues
-
A data consistency issue with CUDA graph capture support.
-
An issue in IBGDA that prevented us from supporting split buffers. Users no longer need to disable VMM or split buffers larger than 2GiB.
-
An issue preventing local buffer registration with IBGDA.
-
An issue preventing cumodule init with IBGDA.