NVSHMEM Release 2.7.0
Welcome to the NVIDIA® NVSHMEM™ 2.7.0 release notes.
Key Features And Enhancements
-
Default Hopper Support (i.e. sm_90 and compute_90)
-
A new (Experimental) CMake build system
-
Performance improvements to the GPU Initiated Communication (GIC) transport. Specifically, improvements were made to the synchronization and concurrency paths in GIC to improve the overall message rate of the transport.
-
Support for CUDA minor version compatibility in the NVSHMEM library and headers.
- Compatibility checks for the inbuilt bootstrap plugins.
- Limited DMA-BUF memory registration support. This enables using NVSHMEM core functionality without the nv_peer_mem or nvidia_peermem modules. DMA-BUF registrations are only supported up to 4 GiB in NVSHMEM 2.7.
- SO Versioning for both the nvshmem_host shared library and the precompiled bootstrap modules.
- NVSHMEM now links statically to libcudart_static.a instead of libcudart.so. This increases the NVSHMEM library size, but removes the requirement for applications to provide the dependency for NVSHMEM.
Limitations
- NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
- The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
- Libfabric support on Slingshot-11 networks requires setting the following environment variable, FI_CXI_OPTIMIZED_MRS=false.
- VMM support is disabled by default on Power 9 systems because of a performance regression.
-
MPG support is not yet available on Power 9 systems.
- Systems with PCIe peer-to-peer communication require one of the following
-
InfiniBand to support NVSHMEM atomics APIs.
-
The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
-
- NVSHMEM host APIs can be dynamically linked, but device APIs can only
be statically linked.
- This is because the linking of CUDA device symbols does not work across shared libraries.
- nvshmem_barrier*, nvshmem_quiet, and
nvshmem_wait_until only ensure ordering and visibility
between source and destination PEs on systems with NVLink and InfiniBand.
- They do not ensure global ordering and visibility.
- When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate
the complete device memory because of the inability to reuse the BAR1
space.
This has been fixed with CUDA driver releases 470 and later and in the 460 branch.
- When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the
CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which
automatically synchronizes synchronous CUDA memory operations on the
symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.
Fixed Issues
- An issue in the local buffer registration path `nvshmemx_buffer_register` where collisions between overlapping memory regions were not properly handled.
-
An issue causing validation errors in collective operations when all GPUs in a job are connected via PCIe without a remote transport using the proxy thread.