Overview#

The NVIDIA DGX SuperPOD® with GB200 and B200 systems is a multi-user system designed to run large AI and HPC applications efficiently. Although a DGX SuperPOD is composed of many different components, it should be thought of as an entity that can manage simultaneous use by many users, provide advanced access controls for queuing, and schedule resources fairly to ensure maximum performance. It also provides the tools for collaboration between users and security controls to protect data and limit interaction between users where necessary. The management tools are designed to treat the multiple components as a single system. For more details about the physical architecture, refer to the NVIDIA DGX SuperPOD Reference Architecture.

This document discusses the range of features and tasks that are supported on the DGX SuperPOD. The constituent elements that make up a DGX SuperPOD, both in hardware and software, support a superset of features compared to the DGX SuperPOD solution. Contact the NVIDIA Technical Account Manager (TAM) if clarification is needed on what functionality is supported by the DGX SuperPOD product.

Note

NVIDIA DGX SuperPOD supports Slurm and Run:AI for scheduling workloads.

System Design#

The following diagram shows the logical design of the DGX SuperPOD:

The components shown in the diagram are described below:

Table 1 Component descriptions#
DGX SuperPOD Component	Description
User Jumphost	The User Jumphost is the gateway into the DGX SuperPOD intended to provide a single entry-point into the cluster and additional security when required. It is not actually a part of the DGX SuperPOD, but of the corporate IT environment. This function is defined and provided by local IT requirements.
DGX Nodes / Compute Trays	The compute trays are where the user work gets done on the system. Each compute tray is considered an individual DGX node.
Management nodes	The management nodes provide the services necessary to support operation and monitoring of the DGX SuperPOD. Services, configured in high availability (HA) mode where needed, provide the highest system availability. See the Management Servers section below for details of each node and its function.
High-speed storage	High-speed storage provides shared storage to all nodes in the DGX SuperPOD. This is where datasets, checkpoints, and other large files should be stored. High-speed storage typically holds large datasets that are being actively operated on by the DGX SuperPOD jobs. Data on the high-speed storage is a subset of all data housed in a data lake outside of the DGX SuperPOD.
Home & High Speed Storage	Shared storage on a network file system (NFS) is allocated for user home directories as well for cluster services.
InfiniBand fabric compute	The Compute InfiniBand Fabric is the high-speed network fabric connecting all compute nodes together to allow high-bandwidth and low-latency communication between GB200 racks.
InfiniBand fabric storage	The Storage InfiniBand Fabric is the high-speed network fabric dedicated for storage traffic. Storage traffic is dedicated to its own fabric to remove interference with the node-to-node application traffic that can degrade overall performance.
In-band network fabric	The In-band Network Fabric provides fast Ethernet connectivity between all nodes in the DGX SuperPOD. The In-band fabric is used for TCP/IP-based communication and services for provisioning and inband management.
Out-of-band network fabric	The out-of-band Ethernet network is used for system management using the BMC and provides connectivity to manage all networking equipment.
NVLink	NVIDIA NVLinK is a high-speed interconnect that allows multiple GPUs to communicate directly. Multi-Node NVLink is a capability enabled over an NVLink Switch network where multiple systems are interconnected to form a large GPU memory fabric also known as an NVLink Domain.

Management Servers#

The following describes the function and services running on the management servers:

Table 2 DGX SuperPOD management servers#
Server Function	Services
Head Node	Head nodes serve various functions: Provisioning: centrally store and deploy OS images of the compute, management nodes, and other various services. This ensures that there is a single authoritative source defining what should be on each node, and a way to re-provision if the node needs to be reimaged. Workload management: resource management and orchestration services that organize the resources and coordinate the scheduling of user jobs across the cluster. Metrics: system monitoring and reporting that gather all telemetry from each of the nodes. The data can be explored and analyzed through web services so better insight to the system can be studied and reported.
Login	Entry point to the DGX SuperPOD for users. CPU only node that is a Slurm client and has filesystems mounted to support development, job submission, job monitoring, and file management. Multiple nodes are included for redundancy and supporting user workloads. These hosts can also be used for container caching.
UFM Appliance	NVIDIA Unified Fabric Manager (UFM) for both storage and compute infiniBand fabric.
NVLink Management Software	NVLink Management Software (NMX) is an integrated platform for management and monitoring of NVLink connections.

Control Plane Services Distribution#

To ensure that services that are responsible for administration of the cluster (admin-space, privileged users) and non-privileged user access (user space) are separated, NVIDIA has two sets of control plane nodes.

The following diagram shows the the service distribution for Mission Control services:

The following control plane nodes are required:

2 Control Plane Nodes for Base Command Manager (BCM)
3 Control Plane Nodes for HA cluster of a Admin Space Kubernetes Control Plane
3 Control Plane Nodes for HA cluster of a User Space Kubernetes Control Plane
2 Control Plane Nodes for SLURM Login Nodes

UFM appliances are sold separately for controlling the E-W fabric and not part of the 5-tuple admin space control plane nodes.

Depending on the selected service model, a set of dedicated, user-space control plane nodes is required as well:

For Customers running SLURM: 2 Nodes to host a HA set of SLURM login nodes. SLURM login nodes are typically configured as bare-metal services. In this setup, SLURM Controller services shall be installed as bare-metal service either on the BCM head nodes or, on the admin space Kubernetes control plane nodes.
For customers running Kubernetes and Run:AI: 3 nodes to host a HA set of Run:AI control plane nodes in a Kubernetes cluster. A dedicated set of Kubernetes control plane is used to host any user-space Kubernetes services, which serves as a user-space only Kubernetes cluster. In this setup, the compute nodes are associated with the user-space Kubernetes cluster.

For detailed reference, see DGX Product Architecture and Management, Version 3.2, July 2025.