NVSHMEM Release 2.7.0

Welcome to the NVIDIA® NVSHMEM™ 2.7.0 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:
  • Default Hopper Support (i.e. sm_90 and compute_90)

  • A new (Experimental) CMake build system

  • Performance improvements to the GPU Initiated Communication (GIC) transport. Specifically, improvements were made to the synchronization and concurrency paths in GIC to improve the overall message rate of the transport.

  • Support for CUDA minor version compatibility in the NVSHMEM library and headers.

  • Compatibility checks for the inbuilt bootstrap plugins.
  • Limited DMA-BUF memory registration support. This enables using NVSHMEM core functionality without the nv_peer_mem or nvidia_peermem modules. DMA-BUF registrations are only supported up to 4 GiB in NVSHMEM 2.7.
  • SO Versioning for both the nvshmem_host shared library and the precompiled bootstrap modules.
  • NVSHMEM now links statically to libcudart_static.a instead of libcudart.so. This increases the NVSHMEM library size, but removes the requirement for applications to provide the dependency for NVSHMEM.

Compatibility

NVSHMEM 2.7.0 has been tested with the following:

Limitations

  • NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
  • The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
  • Libfabric support on Slingshot-11 networks requires setting the following environment variable, FI_CXI_OPTIMIZED_MRS=false.
  • VMM support is disabled by default on Power 9 systems because of a performance regression.
  • MPG support is not yet available on Power 9 systems.

  • Systems with PCIe peer-to-peer communication require one of the following
    • InfiniBand to support NVSHMEM atomics APIs.

    • The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

  • NVSHMEM host APIs can be dynamically linked, but device APIs can only be statically linked.
    • This is because the linking of CUDA device symbols does not work across shared libraries.
  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
    • They do not ensure global ordering and visibility.
  • When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.

    This has been fixed with CUDA driver releases 470 and later and in the 460 branch.

  • When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

    With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.

Fixed Issues

  • An issue in the local buffer registration path `nvshmemx_buffer_register` where collisions between overlapping memory regions were not properly handled.
  • An issue causing validation errors in collective operations when all GPUs in a job are connected via PCIe without a remote transport using the proxy thread.

Breaking Changes

  • Support for Pascal devices was removed.
  • Users are welcome to attempt compiling from source and running nvshmem against Pascal GPUs using the NVCC_GENCODE options, but no further bug fixes or support for Pascal devices will be added to NVSHMEM.

Known Issues

  • Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
  • DMA-BUF registrations are only supported with buffers up to 4 GiB. For heaps or registrations larger than 4 GiB, nvidia_peermem or nv_peer_mem must be used.