Overview#

Today’s mainstream AI models contain billions of parameters, which require tremendous amounts of compute, memory storage, and system connectivity for model training and deployment. Moore’s Law has not kept pace with these exponential increases in computing demand. To overcome Moore’s law, enterprises adopt AI the NVIDIA way, which involves scaling up and out their diverse AI and data analytics applications with NVIDIA GPU computing capabilities. At the core of every NVIDIA® DGX™ and NVIDIA HGX™ system is NVIDIA NVLink™-connected GPUs that access each other’s memory at NVLink speed.

Over the years, NVLink has become the de facto high-speed, direct GPU-to-GPU interconnect for single node multi-GPU systems. Many of these systems are now interconnected with high-speed networking, like InfiniBand, to form supercomputers. With an NVLink Network (also known as the NVLink Multi-Node), these supercomputer systems are inter-connected using NVLink, which acts like a gigantic accelerator with shared memory.

Definitions, Acronyms, and Abbreviations#

Abbreviations

Definitions

Node or Compute Node

An OS instance with at least one GPU.

NVLink Domain or cluster

A set of nodes that can communicate over NVLink.

L1 NVSwitch Tray

First level of NVIDIA NVSwitch™, for example, the NVSwitches to which the GPU NVLinks connect.

NVLink Domain

A set of nodes that can communicate over NVLink

FM

Fabric Manager. NVLink Network control plane service is provided by the FM service

GFM

Global Fabric Manager. An instance of FM with a specific set of features enabled. There is one GFM per NVLink domain (cluster).

LFM

Local Fabric Manager. An instance of FM with specific set of features enabled and that runs on each L1 NVSwitch tray or compute tray.

NVLSM

NVLink Subnet Manager A service that originates from NVIDIA InfiniBand switches and has the necessary modifications to effectively manage NVSwitches.

Access NVLink

An NVLink between a GPU and an NVSwitch.

Trunk NVLink

An NVLink between NVSwitches.

NVOS

NVIDIA Networking OS, which was previously known as MLNX-OS. NVOS is used as the Switch OS for L1 NVSwitch Trays.

NVLink ALI

Autonomous Link Initialization. A feature, which was introduced in NVL4, enables NVLink training to be performed asynchronously by both sides of the link without an increase in software stack coordination.

IMEX Domain

A set of compute nodes connected by NVLink on which the nvidia-imex service has been installed and configured to communicate with each other via the nodes_config.cfg

The NVIDIA Import/Export Service for Internode Memory Sharing#

The IMEX service supports GPU memory export and import (NVLink P2P) and shared memory operations across OS domains in an NVLink multi-node deployment.

Multi-Node Memory Sharing Model#

This section provides information about the multi-node memory sharing model.

Figure ‑. Multi-Node Memory Sharing Model

Multi-Node Memory Sharing Model

At a high level, the job entails coordinating CUDA processes that are running on each compute node in its own OS domain. On one node, the CUDA process allocates GPU memory and obtains the corresponding sharable memory handle. This allocation and handle creation trigger the establishment of a Virtual Address (VA) to Physical Address (PA) to Fabric Address (FA) mapping on the exporting node. Here is an overview of the process:

  1. For importing, the process shares the memory handle with the coordinating process on other nodes using MPI/NCCL.

  2. On these nodes, the CUDA processes prompt the GPU driver to import the relevant memory using the received handle.

  3. The GPU driver on the importing node creates the requisite memory objects and establishes the VA-to-FA mapping.

  4. When the importing process accesses the VA address space, the GPU memory system on the importing nodes recognizes the corresponding physical page tables by leveraging a foreign GPU and generates NVLink packets to access the pertinent memory.

To facilitate the importing node’s establishment of the VA-to-FA mapping, a privileged entity capable of communicating across OS/Node domains is essential to retrieve the necessary memory mapping information from the exporting node. Operating at a higher level, the IMEX service serves this function by acting as an orchestrator for memory export and import across compute nodes.

Here are some key features of the IMEX service:

  • Facilitates memory sharing between compute nodes.

  • Manages the life cycle of the shared memory.

  • Registers for memory import/unimport events with the GPU Driver.

  • Does not directly communicate with CUDA or user applications.

  • Communicates across nodes using the compute node’s network by using TCP/IP and gRPC connections.

  • Runs exclusively on compute nodes.

This guide provides an overview of various IMEX features and is intended for multi-node system administrators.

Note: In an NVLink multi-node cluster, before jobs are launched, start the IMEX service. NVLink multi-node jobs will fail if the IMEX service is not properly initialized. |