NVSHMEM Release 2.0.2 EA
This is the NVIDIA® NVSHMEM™ 2.0.2 EA release notes.
Key Features And Enhancements
-
Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
-
Added support to use the NVIDIA® Collective Communication Library (NCCL) for optimized NVSHMEM host and on-stream collectives.
Note: This feature is not yet supported on Power 9 systems. -
Added support for RDMA over Converged Ethernet (RoCE) networks.
-
Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.
-
Added support for PMIx to enable an NVSHMEM job launch with PMIx-compatible launchers, such as Slurm and Open MPI.
-
Uniformly reformatted the perftest benchmark output.
-
Added support for the putmem_signal and signal_wait_until APIs.
-
Improved support for single-node environments without InfiniBand.
-
Fixed a bug that occurred when large numbers of fetch atomic operations were performed on InfiniBand.
-
Improved topology awareness in NIC-to-GPU assignments for DGX A100 systems.
Fixed Issues
-
Concurrent NVSHMEM collective operations with active sets are not supported.
-
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.
The OpenSHMEM specification has clarified that only memory management routines that operate on NVSHMEM_TEAM_WORLD, and no other collectives on that team, are permitted concurrently.
Known Issues
-
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries and not as shared libraries.
This is because the linking of CUDA device symbols does not work across shared libraries.
-
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
- Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
- In some cases, nvshmem_<typename>_g over InfiniBand and
RoCE has been reported to return stale data.
We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.