NVSHMEM Installation Guide#

This NVIDIA NVSHMEM Installation Guide provides step-by-step instructions to download and install latest NVSHMEM release.

Overview#

NVIDIA^® NVSHMEM™ is a programming interface that implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a CUDA kernel-side interface that allows NVIDIA CUDA^® threads to access any location in the symmetrically-distributed memory.

Hardware Requirements#

NVSHMEM requires the following hardware:

The x86_64 or aarch64 CPU architectures.
NVIDIA Data Center GPU of the NVIDIA Volta™ GPU architecture or later.

For a complete list, refer to https://developer.nvidia.com/cuda-gpus.
All GPUs must be P2P-connected via NVLink/PCIe or via GPUDirect RDMA. The following networks are supported:
- InfiniBand/RoCE with a Mellanox adapter (CX-4 or later)
- Slingshot-11 (Libfabric CXI provider)
- Amazon EFA (Libfabric EFA provider)
Support for atomics requires a NVLink connection or a GPUDirect RDMA connection and GDRCopy. Refer to Software Requirements for more information.

Software Requirements#

NVSHMEM requires the following software:

64-bit Linux.

For a complete compatibility matrix, see the NVIDIA CUDA Installation Guide for Linux.
A C++ Compiler with C++11 support.
CUDA 11 or later.
CMake version 3.19 or later
NVLink SHARP
- Requires CUDA 12.1 or above.
- Requires nvidia.ko at least 530.30.02 or later. It is available in third-generation NVSwitch systems (NVLink4) with Hopper and later GPU architectures, which allows collectives such as nvshmem_<TYPENAME>_reduce to be offloaded to the NVSwitch domain.
(Optional) InfiniBand GPUDirect Async (IBGDA) transport
- Requires Mellanox OFED >= 5.0
- Requires nvidia.ko >= 510.40.3. There are two operational modes supported, default and CPU-assisted. - In the default case, nvidia.ko must be loaded with PeerMappingOverride=1 by changing the options in the /etc/modprobe.d/nvidia.conf file to options nvidia NVreg_RegistryDwords="PeerMappingOverride=1;" - In the CPU-assisted case, PeerMappingOverride is not required.
- For dma-buf support, it requires rdma-core >= 44.0 and upstream kernel >= 6.1.
- In the absence of dma-buf, it requires nvidia-peermem >= 510.40.3
  
  For more information, see: GPUDirect Async.
(Optional) Mellanox OFED.
- This software is required to build the IBRC transport. If the OFED is unavailable, NVSHMEM can be built with NVSHMEM_IBRC_SUPPORT=0 set in the environment.
(Optional) nvidia-peermem for GPUDirect RDMA.
- This software must use the IBRC and UCX transports and is required when NVSHMEM_IBRC_SUPPORT=0 and NVSHMEM_UCX_SUPPORT=0 are not set at compile time.
  
  Note
  
  Both the IBRC and UCX transports make use of GDRCopy in order to perform atomic operations. If the user is using either of these transports and intend on performing atomic operations, they MUST enable GDRCopy support. All other transports do not depend on GDRCopy and it is not needed in those cases.
- A PMI-1 (for example, Hydra), PMI-2 (for example, Slurm), or a PMIx (for example, Open MPI) compatible launcher.
(Optional) GDRCopy v2.0 or later.
- This software is required for atomics support on non-NVLink connections.
- Additionally, this is required if CPU-assisted IBGDA is enabled on the system.
- It is required when NVSHMEM_IBRC_SUPPORT=0 and NVSHMEM_UCX_SUPPORT=0 are not set at compile time.
(Optional) UCX version 1.10.0 or later.
- This software is required to build the UCX transport.
Note

UCX must be configured with --enable-mt and --with-dm.
(Optional) libfabric 1.15.0.0 or later
(Optional) NCCL 2.0 or later.
(Optional) PMIx 3.1.5 or later.

System Requirements#

The CUDA MPS Service is optional. When using multiple processes per GPU, to support the complete NVHSMEM API, the CUDA MPS server must be configured on the system. To avoid deadlock situations, the total GPU utilization that is shared between the processes must be capped at 100% or lower.

Refer to Multi-Process Service for more information about how to configure the MPS server.