Abstract#
The NVIDIA DGX SuperPOD architecture has been designed to power the next-generation AI factories with unparalleled performance, scalability, and innovation that supports all customers in the enterprise, higher education, research, and the public sector. It is a physical twin of the main NVIDIA research and development system, meaning the company’s infrastructure software, applications, and support are first tested and vetted on the same architecture.
This DGX SuperPOD Reference Architecture (RA) is based on DGX GB200 systems powered by Grace CPUs and Blackwell GPUs. The RA discusses the components that define the scalable and modular architecture of DGX SuperPOD. DGX SuperPOD is built on the concept of scalable units (SU); each SU contains 8 DGX GB200 systems, which enables rapid deployment of DGX SuperPOD of anysize. The RA also includes details regarding the SU design and specifics of InfiniBand, NVLink network, Ethernet fabric topologies, storage system specifications, recommended rack layouts, and wiring guidelines.
This RA combines the latest NVIDIA technologies to help companies and industries develop their own AI factories. To achieve the most scalability, DGX SuperPOD is powered by several key NVIDIA technologies and solutions, including:
NVIDIA DGX GB200 systems: Powerful computational building block for AI and HPC.
NVIDIA NDR (400 Gbps) InfiniBand: High performant, low latency, and scalable network interconnect.
NVIDIA Spectrum-4 (800 Gbps) Ethernet: High performant, low latency, and scalable ethernet connectivity for storage.
NVIDIA fifth-generation NVLink® technology: a high-speed interconnect for CPU and GPU processors in accelerated compute systems to provide unprecedented performance for most demanding communication patterns.
NVIDIA Mission Control: a unified operations and orchestration software stack for managing AI factories.