Release Notes :: NVSHMEM Documentation

NVSHMEM Release 2.4.1

This is the NVIDIA^® NVSHMEM™ 2.4.1 release notes.

Key Features And Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added limited support for Multiple Processes per GPU (MPG) on x86 platforms.
- The amount of support depends on the availability of CUDA MPS.
- MPG support is currently not available on Power 9 platforms.
Added a local buffer registration API that allows non-symmetric buffers to be used as local buffers in the NVSHMEM API.
Added support for dynamic symmetric heap allocation, which eliminates the need to specify NVSHMEM_SYMMETRIC_SIZE.
- On x86 platforms, this feature is is enabled by default, and is available with CUDA version 11.3 or later.
- On P9 platforms, this feature is disabled by default, and can be enabled by using the NVSHMEM_DISABLE_CUDA_VMM environment variable.
Support for large RMA messages.
To build NVSHMEM without ibrc support, set NVSHMEM_IBRC_SUPPORT=0 in the environment before you build.

This allows you to build and run NVSHMEM without the GDRCopy and OFED dependencies.
Support for calling nvshmem_init/finalize multiple times with an MPI bootstrap.
Improved testing coverage (large messages, exercising full GPU memory, and so on).
Improved the default PE to NIC assignment for NVIDIA DGX-2™ systems.
Optimized channel request processing by using the CPU proxy thread.
Added support for the shmem_global_exit API.
Removed redundant barriers to improve the collectives’ performance.
Significant code refactoring to use templates instead of macros for internal functions.
Improved performance for device-side blocking RMA and strided RMA APIs.
Bug fix for buffers with large offsets into the NVSHMEM symmetric heap.

Compatibility

NVSHMEM 2.4.1 has been tested with the following:

CUDA:
- 10.2
- 11.0
- 11.5
On x86 and Power 9 processors

Limitations

VMM support is disabled by default on Power 9 systems because of a performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

Fixed Issues

There are no fixed issues in this release.

Breaking Changes

There are no breaking changes in this release.

Known Issues

NVSHMEM can only be linked statically.

This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 (or later) and in the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.