NVSHMEM Release 2.4.1
This is the NVIDIA® NVSHMEM™ 2.4.1 release notes.
Key Features And Enhancements
- Added limited support for Multiple Processes per GPU (MPG) on x86 platforms.
-
Added a local buffer registration API that allows non-symmetric buffers to be used as local buffers in the NVSHMEM API.
-
Added support for dynamic symmetric heap allocation, which eliminates the need to specify NVSHMEM_SYMMETRIC_SIZE.
-
Support for large RMA messages.
-
To build NVSHMEM without ibrc support, set NVSHMEM_IBRC_SUPPORT=0 in the environment before you build.
This allows you to build and run NVSHMEM without the GDRCopy and OFED dependencies.
-
Support for calling nvshmem_init/finalize multiple times with an MPI bootstrap.
-
Improved testing coverage (large messages, exercising full GPU memory, and so on).
-
Improved the default PE to NIC assignment for NVIDIA DGX-2™ systems.
-
Optimized channel request processing by using the CPU proxy thread.
-
Added support for the shmem_global_exit API.
-
Removed redundant barriers to improve the collectives’ performance.
-
Significant code refactoring to use templates instead of macros for internal functions.
-
Improved performance for device-side blocking RMA and strided RMA APIs.
-
Bug fix for buffers with large offsets into the NVSHMEM symmetric heap.
Limitations
- VMM support is disabled by default on Power 9 systems because of a performance regression.
- MPG support is not yet available on Power 9 systems.
- Systems with PCIe peer-to-peer communication require one of the
following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
Known Issues
-
NVSHMEM can only be linked statically.
This is because the linking of CUDA device symbols does not work across shared libraries.
-
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
- Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
- When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate
the complete device memory because of the inability to reuse the BAR1
space.
This will be fixed with future CUDA driver releases in the 470 (or later) and in the 460 branch.
- When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the
CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which
automatically synchronizes synchronous CUDA memory operations on the
symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for synchronization. For additional information about synchronous CUDA memory operations, see API synchronization behavior.