NVSHMEM Installation Guide#
This NVIDIA NVSHMEM Installation Guide provides step-by-step instructions to download and install latest NVSHMEM release.
Overview#
NVIDIA® NVSHMEM™ is a programming interface that implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a CUDA kernel-side interface that allows NVIDIA CUDA® threads to access any location in the symmetrically-distributed memory.
Hardware Requirements#
NVSHMEM requires the following hardware:
- The - x86_64or- aarch64CPU architectures.
- NVIDIA Data Center GPU of the NVIDIA Volta™ GPU architecture or later. - For a complete list, refer to https://developer.nvidia.com/cuda-gpus. 
- All GPUs must be P2P-connected via NVLink/PCIe or via GPUDirect RDMA. The following networks are supported:
- InfiniBand/RoCE with a Mellanox adapter (CX-4 or later) 
- Slingshot-11 (Libfabric CXI provider) 
- Amazon EFA (Libfabric EFA provider) 
 
 - Support for atomics requires a NVLink connection or a GPUDirect RDMA connection and GDRCopy. Refer to Software Requirements for more information. 
Software Requirements#
NVSHMEM requires the following software:
- 64-bit Linux. - For a complete compatibility matrix, see the NVIDIA CUDA Installation Guide for Linux. 
- A C++ Compiler with C++11 support. 
- CUDA 11 or later. 
- CMake version 3.19 or later 
- NVLink SHARP - Requires CUDA 12.1 or above. 
- Requires nvidia.ko at least 530.30.02 or later. It is available in third-generation NVSwitch systems (NVLink4) with Hopper and later GPU architectures, which allows collectives such as - nvshmem_<TYPENAME>_reduceto be offloaded to the NVSwitch domain.
 
- (Optional) InfiniBand GPUDirect Async (IBGDA) transport - Requires Mellanox OFED >= 5.0 
- Requires nvidia.ko >= 510.40.3. There are two operational modes supported, default and CPU-assisted. - In the default case, - nvidia.komust be loaded with- PeerMappingOverride=1by changing the options in the- /etc/modprobe.d/nvidia.conffile to options nvidia- NVreg_RegistryDwords="PeerMappingOverride=1;"- In the CPU-assisted case,- PeerMappingOverrideis not required.
- For dma-buf support, it requires - rdma-core>= 44.0 and upstream kernel >= 6.1.
- In the absence of dma-buf, it requires - nvidia-peermem>= 510.40.3- For more information, see: GPUDirect Async. 
 
- (Optional) Mellanox OFED. - This software is required to build the IBRC transport. If the OFED is unavailable, NVSHMEM can be built with - NVSHMEM_IBRC_SUPPORT=0set in the environment.
 
- (Optional) nvidia-peermem for GPUDirect RDMA. - This software must use the IBRC and UCX transports and is required when - NVSHMEM_IBRC_SUPPORT=0and- NVSHMEM_UCX_SUPPORT=0are not set at compile time.- Note - Both the IBRC and UCX transports make use of GDRCopy in order to perform atomic operations. If the user is using either of these transports and intend on performing atomic operations, they MUST enable GDRCopy support. All other transports do not depend on GDRCopy and it is not needed in those cases. 
- A PMI-1 (for example, Hydra), PMI-2 (for example, Slurm), or a PMIx (for example, Open MPI) compatible launcher. 
 
- (Optional) GDRCopy v2.0 or later. - This software is required for atomics support on non-NVLink connections. 
- Additionally, this is required if CPU-assisted IBGDA is enabled on the system. 
- It is required when - NVSHMEM_IBRC_SUPPORT=0and- NVSHMEM_UCX_SUPPORT=0are not set at compile time.
 
- (Optional) UCX version 1.10.0 or later. - This software is required to build the UCX transport. 
 - Note - UCX must be configured with - --enable-mtand- --with-dm.
- (Optional) libfabric 1.15.0.0 or later 
- (Optional) NCCL 2.0 or later. 
- (Optional) PMIx 3.1.5 or later. 
System Requirements#
The CUDA MPS Service is optional. When using multiple processes per GPU, to support the complete NVHSMEM API, the CUDA MPS server must be configured on the system. To avoid deadlock situations, the total GPU utilization that is shared between the processes must be capped at 100% or lower.
Refer to Multi-Process Service for more information about how to configure the MPS server.