NVIDIA® NVSHMEM 3.3.9 Release Notes#
NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a NVIDIA CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVIDIA® NVSHMEM 3.3.9 and earlier releases.
Key Features and Enhancements#
This NVSHMEM release includes the following key features and enhancements:
Enabled GA platform support for Blackwell B200/GB200NVL72-based systems. Additionally, enabled SASS support for Ada architecture.
Added official Python language bindings (
nvshmem4py
) enabling symmetric memory management, on-stream RMA, and collective APIs to aid in development of custom kernels using symmetric memory and enable fine-grained communication in native Python. Thenvshmem4py
package is available via PyPI wheels/conda installers.Added support for CUDA Templates for Linear Algebra Subroutines and Solvers (cuTLASS)-compliant tile-granular NVLS device-sided collectives to aid development of fused distributed GeMM kernels.
Added support for flexible team initialization API (
nvshmemx_team_init
) using an arbitrary set of PEs to enable non-linear, non-contiguous PE indexing, if desired.Added support for symmetric user-buffer registration (
nvshmemx_buffer_register_symmetric
) to enable ML frameworks to “bring-your-own-buffer” (BYOB) for zero-copy communication kernels.Added support for narrow types (
float16
,bfloat16
), precision support for NVLSreducescatter
collective, and LL8fcollect
algorithm for low-latency collectives.Added support for device-side
nvshmem_broadcastmem
,nvshmem_fcollectmem
APIs in the library.Added support for CUDA module-independent loading using
nvshmemx_culibrary_init
.Added support for leveraging multiple Queue-Pairs (QPs) on LAG bonded NICs for RDMA transports. You can use the
NVSHMEM_IB_NUM_RC_PER_DEVICE
environment variable to tune this value as desired.Added support for randomizing QP assignment for multiple GPU endpoints when communicating over IBGDA transport.
Added CUDA graph capture capabilities to on-stream collectives’ performance benchmarks using
--cudagraph
command-line parameter.Enabled host-side clang compilation support for NVSHMEM host library.
Improved GPU thread-occupancy for on-stream
fcollect
when utilizing NVLS and LL algorithms by 30%.Improved multi-SM NVLS on-stream collectives to adapt
gridDim
as a function of NVLINK domain size.Improved runtime detection of CUDA VMM support and fall back to legacy pinned memory allocation
cudaMalloc
when platform support is not available for VMM.Improved resiliency of querying Global Identifier (GID) via
sysfs
for RoCE transports in containerized environment.Improved perftest presentation layer to provide additional
count
column capturing total number of elements per operation, independent of datatype size.Improved point-to-point signaling latency to always leverage CE-centric APIs
cuStreamWriteValue/cuStreamWaitValue
by 20%.
Compatibility#
NVSHMEM 3.3.9 has been tested with the following:
NVIDIA CUDA® Toolkit:
12.2
12.6
12.9
CPUs
On x86 and NVIDIA GraceTM processors.
GPUs
NVIDIA Ampere A100
NVIDIA HopperTM
NVIDIA Blackwell
Limitations#
NVSHMEM is not compatible with the PMI client library on Cray systems and must use the NVSHMEM internal PMI-2 client library.
Jobs can be launched with the PMI bootstrap by specifying
--mpi=pmi2
to Slurm andNVSHMEM_BOOTSTRAP_PMI=PMI-2
or directly by using the MPI or SHMEM bootstraps.PMI-2 can also be set as the default PMI by setting
NVSHMEM_DEFAULT_PMI2=1
when building NVSHMEM.
The libfabric transport does not support VMM yet, so you must disable VMM by setting
NVSHMEM_DISABLE_CUDA_VMM=1
.Systems with PCIe peer-to-peer communication require one of the following:
InfiniBand to support NVSHMEM atomics APIs.
Using NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
nvshmem_barrier*
,nvshmem_quiet
, ornvshmem_wait_until
only ensures ordering and visibility between the source and destination PEs and does not ensure global ordering and visibility.When built with GDRCopy, and when using InfiniBand on earlier versions of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in release 470 and later.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace + Ada L40 platforms.
NVSHMEM is not supported on virtualized environments (VM).
User buffers registered with
nvshmemx_buffer_register_symmetric
lack support forlibfabric
transport to perform GPU-GPU communication over Remote networks (EFA, Slingshot, etc.).When registering Extended GPU memory (EGM) user buffers with
nvshmemx_buffer_register_symmetric
, the buffers on different PEs must belong to distinct CPU sockets within a node. This can be achieved by selecting GPUs on a different NUMA domain using theCUDA_VISIBLE_DEVICES
environment variable.
Fixed Issues#
Fixed a bug in perftest reporting when both datatype and reduceop are specified.
Fixed an application crash when
nvshmemx_fcollect_on_stream
attempts to use more CTA than available NVSHMEM teams.Fixed an application crash when NVSHMEM remote transports attempt to use more than 16 HCAs per node.
Fixed an application crash in the
nvshmemx_mc_ptr
API that is caused when executed on a platform without NVLS support.Fixed a compile-time error with LLVM IR bitcode device library when compiling with clang-llvm > 19.
Fixed a compile-time error with IBGDA support when built without GDRCopy support.
Fixed a compile-time error with
moe_shuffle.cu
caused by a missinggetopt
header.Fixed a data corruption bug in device-side pt-to-pt get/put bandwidth test due to missing usage of non-symmetric memory buffers for bandwidth summarization.
Fixed a host clang compilation bug due to missing
__CUDA_ARCH__
conditional check for non-CUDA device inline assembly code path.Fixed a bug in the symmetric memory management layer that was caused by a missing override for
NVSHMEM_CUMEM_GRANULARITY
for static device memory heaps (cudaMalloc
).Fixed a data corruption bug in on-stream NVLS 2-shot allreduce, 1-shot allgather collective that was caused by missing memory fence to order data and barrier.
Fixed a stale sysfs file path used for
nvidia-peermem
discovery that caused regression when running with CUDA 12.8 or above.Fixed a hang in MPI bootstrap allgather collective caused by incorrect usage of
MPI_IN_PLACE
operations.Fixed a data correctness bug in the LLVM IR bitcode device library that caused incorrect results for 16-byte aligned put/get operations which had a size that was not a multiple of 16 bytes.
Breaking Changes#
There are no breaking changes in this release.
Deprecated Features#
Support for Volta V100 platform is now deprecated.
Known Issues#
Complex types, which are enabled by setting
NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported.When enabling libfabric transport with
NVSHMEM_LIBFABRIC_PROVIDER=EFA
, certain operations are experimental and might result in the application kernel hanging with the following operations:Device side
nvshmem_put/nvshmem_get
withnvshmem_barrier
.Host side
nvshmem_put_on_stream/nvshmem_get_on_stream
.
When enabling UCX remote transport with
NVSHMEM_REMOTE_TRANSPORT=UCX
, a data mismatch might be observed when scaling 32 PEs or more on DGX-2 platform.