NVIDIA DGX SuperPOD - Run:ai Overview#

System Design#

The following table describes the components in the DGX SuperPOD with Run:ai.

DGX SuperPOD Component

Description

DGX Nodes / Compute Trays

The compute trays are where the user work gets done on the system. Each compute tray is considered an individual DGX node.

Management nodes

The management nodes provide the services necessary to support operation and monitoring of the DGX SuperPOD. Services, configured in high availability (HA) mode where needed, provide the highest system availability. See the management servers section below for details of each node and its function.

High-speed storage

High-speed storage provides shared storage to all nodes in the DGX SuperPOD. This is where datasets, checkpoints, and other large files should be stored. High-speed storage typically holds large datasets that are being actively operated on by the DGX SuperPOD jobs. Data on the high-speed storage is a subset of all data housed in a data lake outside of the DGX SuperPOD.

Home & High Speed Storage

Shared storage on a network file system (NFS) is allocated for user home directories as well for cluster services.

InfiniBand fabric compute

The Compute InfiniBand Fabric is the high-speed network fabric connecting all compute nodes together to allow high-bandwidth and low-latency communication between GB200 racks.

InfiniBand fabric storage

The Storage InfiniBand Fabric is the high-speed network fabric dedicated for storage traffic. Storage traffic is dedicated to its own fabric to remove interference with the node-to-node application traffic that can degrade overall performance.

In-band network fabric

The In-band Network Fabric provides fast Ethernet connectivity between all nodes in the DGX SuperPOD. The In-band fabric is used for TCP/IP-based communication and services for provisioning and inband management.

Out-of-band network fabric

The out-of-band Ethernet network is used for system management using the BMC and provides connectivity to manage all networking equipment.

NVLink

NVIDIA NVLink is a high-speed interconnect that allows multiple GPUs to communicate directly. Multi-Node NVLink is a capability enabled over an NVLink Switch network where multiple systems are interconnected to form a large GPU memory fabric also known as an NVLink Domain.

Management Servers#

The following table details the function and services running on the management servers.

Server Function

Services

Head Node

Head nodes serve various functions:

  • Provisioning: centrally store and deploy OS images of the compute, control and management nodes and other various services. This ensures that there is a single authoritative source defining what should be on each node, and a way to re-provision if the node needs to be reimaged.

  • Workload management: resource management and orchestration services that organize the resources and coordinate the scheduling of user jobs across the cluster.

  • Metrics: system monitoring and reporting that gather all telemetry from each of the nodes. The data can be explored and analyzed through web services so better insight to the system can be studied and reported.

Run:ai Control Plane

The control plane for Run:ai that runs on Kubernetes and powers the CLI and UI for submitting workloads into the cluster. Provides an interface for admins to manage quotas and for users to submit and track jobs or datasets.

Run:ai Cluster

The set of components in the Kubernetes cluster that intelligently schedules and manages all workloads submitted into the Run:ai Control Plane.

UFM Appliance

NVIDIA Unified Fabric Manager (UFM) for both storage and compute InfiniBand fabric.

NVLink Management Software

NVLink Management Software (NMX) is an integrated platform for management and monitoring of NVLink connections.

Using Run:ai on the DGX#

Run:ai provides an interface for scheduling workloads into a DGX cluster. Users interact with Run:ai by logging into the UI or CLI from their local system.

Run:ai is capable of deploying many workload types, including interactive development jobs, batch training jobs, and ongoing inference jobs.

For detailed usage information on Run:ai refer to the official Run:ai usage docs.

Using Kubernetes#

The Run:ai platform runs on-top of Kubernetes. This Kubernetes cluster runs across the Run:ai control nodes and the DGX compute nodes that have been allocated to Run:ai.

The supported path for deploying workloads is through Run:ai.

Users may gain access to the cluster through BCM to run kubectl or helm commands. Workloads or custom components deployed into the cluster through kubectl or helm will not be supported and may break the existing Run:ai installation or lead to cluster instability.

This is not advised.