NVIDIA® NVSHMEM 3.1.7 Release Notes#
NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a NVIDIA CUDA® kernel-side interface that allows NVIDIA CUDA® threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVIDIA® NVSHMEM 3.1.7 and earlier releases.
Key Features and Enhancements#
This NVSHMEM release includes the following key features and enhancements:
Added support for NVLINK SHARP (NVLS) based collective algorithms on x86 + Hopper and Grace Hopper architecture based single and multi-node NVLINK platforms for popular deep-learning collective communications (ReduceScatter, Allgather, and Allreduce) device and on-stream APIs. This feature improves latency for small-message size by a 2-3x speedup when compared with one-shot algorithms over NVLINK.
Added support for GPU kernels that want to use a low-level query API to NVLS enabled symmetric memory using
nvshmemx_mc_ptr
host and device API for a given targetteam
.Added support for new Low-Latency protocol (LL128) for Allgather collective communication device and on-stream APIs.
Enhanced support for existing low-latency protocol (LL) warp-scoped collective to provide a 2x speedup over traditional algorithms when scaling up number of GPUs up to 32.
Added support for the half-precision (FP16/BF16) format on collective communication (ReduceScatter, Allgather, and Allreduce) on-device and on-stream APIs.
Added support for Python wheels using the PyPI repository and the rpm/deb package distribution.
Added support for dynamic RDMA Global Identifier (GID) discovery for RoCE transports. This feature enables automatic fallback to the discovered GID without requiring users to specify the GID using a runtime variable.
Added support for a heterogenous library build system. This feature allows the NVSHMEM static library to be built with a separate CUDA version from the NVSHMEM host library. This enables new features, such as NVLS in the host library and allowing applications compiled against lower versions of CUDA to link to the NVSHMEM device library. This makes the entire library portable to different CUDA minor versions and remain feature complete. Users can specify a CUDA version for the device library by specifying
NVSHMEM_DEVICELIB_CUDA_HOME=<PATH TO CUDA>
. Otherwise the host CUDA version will be used.Enhanced support for NVSHMEM on stream signal APIs to use
cuStreamWriteValue
over P2P connected GPUs when possible. This makes it possible to have a zero-SM implementation of the on stream signaling op when possible.Added support for DMABuf-based registration of NIC control-structures in IBGDA to leverage the DMABuf mainline support in newer linux kernels over proprietary solution nvidia-peermem.
Added a sample code for NVSHMEM UniqueID (UID) socket based bootstrap modality under
examples
directory.Added support for NVSHMEM performance benchmarks to the release binary packages.
Enhanced collectives performance reporting by adding the Algorithmic Bandwidth (algoBW) and Bus Bandwidth (BusBW) metrics to the NVSHMEM performance benchmarks.
Enhanced the reduce-based collective symmetric memory scratch space to 512KB to accommodate the additional space for the
reducescatter
based collectives.Fixed support for the Ninja build generator in our CMake build system.
Added support for the symmetric memory management unit test framework.
Fixed several minor bugs and memory leaks.
Compatibility#
NVSHMEM 3.1.7 has been tested with the following:
NVIDIA CUDA® Toolkit:
11.8
12.2
12.6
CPUS
On x86 and NVIDIA GraceTM processors.
GPUS
Volta V100
Ampere A100
NVIDIA HopperTM
Limitations#
NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.
Jobs can be launched with the PMI bootstrap by specifying
--mpi=pmi2
to Slurm andNVSHMEM_BOOTSTRAP_PMI=PMI-2
or directly by using the MPI or SHMEM bootstraps.PMI-2 can also be set as the default PMI by setting
NVSHMEM_DEFAULT_PMI2=1
when building NVSHMEM.
The libfabric transport does not yet support VMM, so VMM must be disabled by setting
NVSHMEM_DISABLE_CUDA_VMM=1
.Systems with PCIe peer-to-peer communication requires one of the following:
InfiniBand to support NVSHMEM atomics APIs.
Using NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
nvshmem_barrier*
,nvshmem_quiet
, andnvshmem_wait_until
only ensures ordering and visibility between the source and destination PEs and does not ensure global ordering and visibility.When built with GDRcopy, and when using Infiniband on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the latest VIDIA CUDA® latest 460 driver and in release 470 and later.
IBGDA transport support on NVIDIA Grace HopperTM systems is experimental in this release.
IBGDA does not work with DMABUF.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace + Ada L40 platforms.
Fixed Issues#
Fixed a bug that was related to the incorrect use of
NVSHMEM_DEVICE_TIMEOUT_POLLLING
build time variable.Fixed a performance bug in on-stream collectives perftest related to using
cudaMemcpyAsync
on the same CUDA stream, wherecudaEvent
for profiling the start and end time of the on-stream communication kernel are submitted.Fixed a bug that was related to virtual member functions of
nvshmemi_symmetric_heap
by forcing its access specifier to beprotected
to limit its access to only inherited child classes.Fixed a bug that was related to the recursive destructor memory corruption and
nullptr
access to the static member function of thenvshmemi_mem_transport
class.Fixed a bug that was related to the incorrect compile-time definition for the
NVML_GPU_FABRIC_STATE_COMPLETED
andNVML_GPU_FABRIC_UUID_LEN
constants.Fixed a bug that was existed related to
nvshmemx_collective_launch_query_gridsize
, which could cause it to erroneously return a gridsize of 0.Fixed a bug that was related to
nvshmem_init
, which might cause the application to crash in MNNVL discovery when used with the CUDA compat libraries at runtime for the CUDA toolkit 12.4 or later.Fixed a bug that was related to
nvshmemx_collective_launch
, which might cause the duplicate initialization of the NVSHMEM device state.Fixed a bug that was related to uninitialized variables in the IBGDA device code.
Fixed a bug that was related to out-of-bound access (OOB) in the atomic BW performance test.
Fixed a bug that was related to missing C/C++
stdint
headers on Ubuntu24.04 and x86-based systems.Fixed a bug that was related to the incorrect calculation of team-specific stride when creating a new team using
nvshmem_team_split_strided
.
Breaking Changes#
There are no breaking changes in this release.
Deprecated Features#
Removed host API-based nvshmem collectives performance benchmarks.
Known Issues#
Complex types, which are enabled by setting
NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported.When enabling libfabric transport with
NVSHMEM_LIBFABRIC_PROVIDER=EFA
, certain operations are experimental and might result in the application kernel hanging with the following operations:Device side
nvshmem_put/nvshmem_get
withnvshmem_barrier
.Host side
nvshmem_put_on_stream/nvshmem_get_on_stream
.
When enabling UCX remote transport with
NVSHMEM_REMOTE_TRANSPORT=UCX
, a data mismatch might be observed when scaling 32 PEs or more on DGX-2 platform.