Abstract#

The NVIDIA DGX SuperPOD™ architecture has been designed to power the next-generation AI factories with unparalleled performance, scalability, and innovation that supports all customers in the enterprise, higher education, research, and the public sector. It is a physical twin of the main NVIDIA research and development system, meaning the company’s infrastructure software, applications, and support are first tested and vetted on the same architecture.

This DGX SuperPOD Reference Architecture (RA) is based on DGX RUBIN NVL8 systems powered by NVIDIA Rubin GPUs. The RA discusses the components that define the scalable and modular architecture of DGX SuperPOD. DGX SuperPOD is built on the concept of Scalable Units (SU); each SU contains 72 DGX RUBIN NVL8 systems, which enables rapid deployment of DGX SuperPOD of any size. The RA also includes details regarding the SU design and specifics of InfiniBand, Ethernet fabric topologies, storage system specifications, recommended rack layouts, and wiring guidelines.

_images/image2.png

Figure 1 NVIDIA DGX SuperPOD Scalable Unit#

This RA combines the latest NVIDIA technologies to help companies and industries develop their own AI factories. To achieve the most scalability, DGX SuperPOD is powered by several key NVIDIA technologies and solutions, including:

  • NVIDIA DGX Rubin NVL8 system, which provides one of the most powerful computational building blocks for AI and High-Performance Computing (HPC).

  • NVIDIA Quantum-X800 (XDR, 800 Gbps) InfiniBand: High performance, low latency and scalable network interconnect.

  • NVIDIA® NVLink™ technology forms a high-bandwidth NVLink fabric connecting the eight Rubin GPUs within each NVL8 configuration, delivering unprecedented performance for the most demanding GPU-to-GPU communication patterns.

  • NVIDIA ConnectX®‑9 SuperNIC™ delivers up to 800 Gb/s ultra‑low‑latency InfiniBand and Ethernet connectivity, supercharging GPU, CPU, and storage interconnects to sustain large‑scale AI training and inference workloads.

  • NVIDIA BlueField®-4 Data Processing Unit (DPU) delivers up to 800 Gb/s of connectivity, while offloading networking, storage, and security services from host CPUs, enabling programmable, low‑latency data movement and zero‑trust isolation.

  • NVIDIA DOCA as infrastructure software framework for NVIDIA BlueField DPUs and NVIDIA ConnectX SuperNICs across AI factories and DGX SuperPOD deployments.

  • NVIDIA Mission Control is a unified operations and orchestration software stack for managing AI factories.

The DGX SuperPOD architecture integrates NVIDIA software solutions including NVIDIA Mission Control, NVIDIA DOCA, NVIDIA AI Enterprise, CUDA, and NVIDIA Magnum IO™. These technologies help keep the system running at the highest levels of availability and performance, and with NVIDIA Enterprise Support (NVEX), keeps all components and applications running smoothly.

This RA discusses the components that define the scalable and modular architecture of DGX SuperPOD. The system is built on the concept of SU, each containing 72 DGX RUBIN NVL8 systems, which provides for rapid deployment of systems of multiple sizes. This RA includes details regarding the SU design and specifics of InfiniBand, NVLink network, Ethernet fabric topologies, storage system specifications, recommended rack layouts, and wiring guides.

The NVIDIA DGX™ Rubin NVL8 is offered in liquid-cooled, DC-Busbar powered configuration only. NVIDIA MGX™-compatible racks with supported busbar and liquid cooling manifold are required for the installation of DGX Rubin NVL8 system.

Scope of DGX SuperPOD#

NVIDIA DGX SuperPOD is a product that is defined by this RA with specific bill of materials for both hardware and software. It is designed for single-tenant, multi-user enterprise environment.

The designed mode of operation for DGX SuperPOD is as follows:

  • NVIDIA Infrastructure Services (NVIS)-led installation and expert service for initial bring-up and commissioning.

  • Customer-owned infrastructure-as-a-product, where the customer is responsible for day-to-day operation with premium support from an NVIDIA Premium Technical Account Manager (TAM) and optional Continuous Bring-Up (CBU) service.

  • Functional and performance updates during the product lifecycle.

  • Administration of the DGX SuperPOD with NVIDIA Mission Control software.

  • User and application access using Slurm or NVIDIA Run:ai.

For customers who want to modify their DGX SuperPOD after delivery for additional capabilities, such as general Kubernetes cluster access or reconfiguration into a multi-tenant environment for enterprise private cloud or public cloud services, modifications are performed at the customers’ own discretion and risk. NVIDIA does not support these capabilities and performance with the DGX SuperPOD as a product. However, enterprise-level support is available for NVIDIA components (for both software and hardware) that are available in the DGX SuperPOD.