NVSHMEM Release 2.1.2

This is the NVIDIA® NVSHMEM™ 2.1.2 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:
  • Added a new UCX internode communication transport layer.

    Note: UCX is experimental for this release.
  • Added support for the automatic warp-level coalescing of nvshmem_g operations.

  • Added support for put-with-signal operations on CUDA streams.

  • Added support to map the symmetric heap by using the cuMem APIs.

  • Improved the performance of the single-threaded NVSHMEM put/get device API.

  • Added the NVSHMEM_MAX_TEAMS environment variable to specify the maximum number of teams that can be created.

  • Improved the host and on-stream Alltoall performance by using NCCL.

  • Fixed a bug in the compare-and-swap operation that caused several bytes of the compare operand to be lost.

  • Improved support for single-node environments without InfiniBand.

  • Added CPU core affinity to debugging output.

  • Added support for the CUDA 11.3 cudaDeviceFlushGPUDirectRDMAWrites API for consistency.

  • Improved support for the NVIDIA Tools Extension (NVTX) to enable performance analysis through NVIDIA NSight.

  • Removed the NVSHMEM_IS_P2P_RUN environment variable, because runtime automatically determines it.

  • Made improvements to NVSHMEM example codes.

  • Added the NVSHMEM_REMOTE_TRANSPORT environment variable to select the networking layer that is used for communication between nodes.

  • Set the maxrregcount to 32 for non-inlined device functions to ensure that calling these NVSHMEM functions does not negatively affect kernel occupancy.


NVSHMEM 2.1.2 has been tested with the following:


Systems with PCIe peer-to-peer communication require InfiniBand to support NVSHMEM atomics APIs.

Fixed Issues

There are no fixed issues in this release.

Breaking Changes

  • Removed the following deprecated constants:



  • Removed support for the deprecated nvshmem_wait API.

Known Issues

  • NVSHMEM can only be linked statically.

    This is because the linking of CUDA device symbols does not work across shared libraries.

  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

    They do not ensure global ordering and visibility.

  • Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
  • In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.

    We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.

  • When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.

    This will be fixed with future CUDA driver releases in the 470 (or later) and in the 460 branch.

  • When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

    With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.