NVSHMEM Installation Guide

This NVIDIA NVSHMEM Installation Guide provides step-by-step instructions to download and install NVSHMEM 3.0.6


NVIDIA® NVSHMEM™ is a programming interface that implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM also provides a CUDA kernel-side interface that allows NVIDIA CUDA® threads to access any location in the symmetrically-distributed memory.

Hardware Requirements

NVSHMEM requires the following hardware:

  • The x86_64 or aarch64 CPU architectures.

  • NVIDIA Data Center GPU of the NVIDIA Volta™ GPU architecture or later.

    For a complete list, refer to https://developer.nvidia.com/cuda-gpus.

  • All GPUs must be P2P-connected via NVLink/PCIe or via GPUDirect RDMA. The following networks are supported:
    • InfiniBand/RoCE with a Mellanox adapter (CX-4 or later)

    • Slingshot-11 (Libfabric CXI provider)

    • Amazon EFA (Libfabric EFA provider)

    Support for atomics requires a NVLink connection or a GPUDirect RDMA connection and GDRCopy. Refer to Software Requirements for more information.

Software Requirements

NVSHMEM requires the following software:

  • 64-bit Linux.

    For a complete compatibility matrix, see the NVIDIA CUDA Installation Guide for Linux.

  • A C++ Compiler with C++11 support.

  • CUDA 10.2 or later.

  • CMake version 3.19 or later

  • (Optional) InfiniBand GPUDirect Async (IBGDA) transport

    • Requires Mellanox OFED >= 5.0

    • Requires nvidia.ko >= 510.40.3. There are two operational modes supported, default and CPU-assisted. - In the default case, nvidia.ko must be loaded with PeerMappingOverride=1 by changing the options in the /etc/modprobe.d/nvidia.conf file to options nvidia NVreg_RegistryDwords=”PeerMappingOverride=1;” - In the CPU-assisted case, PeerMappingOverride is not required.

    • Requires nvidia-peermem >= 510.40.3

      For more information, see: GPUDirect Async.

  • (Optional) Mellanox OFED.

    • This software is required to build the IBRC transport. If the OFED is unavailable, NVSHMEM can be built with NVSHMEM_IBRC_SUPPORT=0 set in the environment.

  • (Optional) nvidia-peermem for GPUDirect RDMA.

    • This software must use the IBRC and UCX transports and is required when NVSHMEM_IBRC_SUPPORT=0 and NVSHMEM_UCX_SUPPORT=0 are not set at compile time.


      Both the IBRC and UCX transports make use of GDRCopy in order to perform atomic operations. If the user is using either of these transports and intend on performing atomic operations, they MUST enable GDRCopy support. All other transports do not depend on GDRCopy and it is not needed in those cases.

    • A PMI-1 (for example, Hydra), PMI-2 (for example, Slurm), or a PMIx (for example, Open MPI) compatible launcher.

  • (Optional) GDRCopy v2.0 or later.

    • This software is required for atomics support on non-NVLink connections.

    • It is required when NVSHMEM_IBRC_SUPPORT=0 and NVSHMEM_UCX_SUPPORT=0 are not set at compile time.

  • (Optional) UCX version 1.10.0 or later.

    • This software is required to build the UCX transport.


    UCX must be configured with --enable-mt and --with-dm.

  • (Optional) libfabric or later

  • (Optional) NCCL 2.0 or later.

  • (Optional) PMIx 3.1.5 or later.

System Requirements

The CUDA MPS Service is optional. When using multiple processes per GPU, to support the complete NVHSMEM API, the CUDA MPS server must be configured on the system. To avoid deadlock situations, the total GPU utilization that is shared between the processes must be capped at 100% or lower.

Refer to Multi-Process Service for more information about how to configure the MPS server.