NVSHMEM Release 2.6.0

Welcome to the NVIDIA® NVSHMEM™ 2.6.0 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:
  • Added new GPU initiated communication transport that allows kernel initiated communication to be issued directly to the NIC and bypass the CPU proxy thread. The transport is currently provided in experimental mode. It is disabled by default. Please refer to the installation guide for how to enable it.

  • Updated the libfabric transport with initial support for Slingshot-11 networks. Performance tuning for the libfabric transport is ongoing.

  • Added collective algorithms for bcast/fcollect/reduce that use low latency (LL) optimization by sending data and synchronization together, resulting in significant performance improvements.

  • Added warp- and block-scope implementation of recursive exchange algorithm for reduce collectives.

  • Fixed bug in host/on-stream RMA API for very large data transfers.

  • Fixed bug in implementation of nvshmem_fence and nvshmemx_quiet_on_stream API.

Compatibility

NVSHMEM 2.6.0 has been tested with the following:

Limitations

  • NVSHMEM is not yet compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building.
  • The libfabric transport does not yet support VMM and VMM must be disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
  • Libfabric support on Slingshot-11 networks requires setting the following environment variable
    • FI_CXI_OPTIMIZED_MRS=false
    • VMM support is disabled by default on Power 9 systems because of a performance regression.
    • MPG support is not yet available on Power 9 systems.
    • Systems with PCIe peer-to-peer communication require one of the following:
      • InfiniBand to support NVSHMEM atomics APIs.
      • The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

Fixed Issues

There are no fixed issues in this release.

Breaking Changes

There are no breaking changes in this release.

Known Issues

  • NVSHMEM device APIs can only be statically linked.

    This is because the linking of CUDA device symbols does not work across shared libraries.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

    They do not ensure global ordering and visibility.

  • Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
  • When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.

    This has been fixed with CUDA driver releases 470 and later in the 460 branch.

  • When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

    With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.