Abstract#
The NVIDIA DGX SuperPOD architecture has been designed to power the next-generation AI factories with unparalleled performance, scalability, and innovation that supports all customers in the enterprise, higher education, research, and the public sector. It is a physical twin of the main NVIDIA research and development system, meaning the company’s infrastructure software, applications, and support are first tested and vetted on the same architecture.
This DGX SuperPOD Reference Architecture (RA) is based on DGX B300 systems powered by Blackwell GPUs. The RA discusses the components that define the scalable and modular architecture of DGX SuperPOD. DGX SuperPOD is built on the concept of scalable units (SU); each SU contains 64 DGX B300 systems, which enables rapid deployment of DGX SuperPOD of any size. The RA also includes details regarding the SU design and specifics of InfiniBand, Ethernet fabric topologies, storage system specifications, recommended rack layouts, and wiring guidelines.
This RA combines the latest NVIDIA technologies to help companies and industries develop their own AI factories. To achieve the most scalability, DGX SuperPOD is powered by several key NVIDIA technologies and solutions, including:
NVIDIA DGX B300 system provides the most powerful computational building block for AI and HPC.
NVIDIA XDR (800 Gbps) InfiniBand: High performance, low latency, and scalable network interconnect.
NVIDIA Spectrum-X (800 Gbps) Ethernet: High performant, low latency, and scalable ethernet connectivity for compute interconnect.
NVIDIA NVLink® technology—networking technologies that connect GPUs at the NVLink layer to provide unprecedented performance for the most demanding communication patterns.
NVIDIA Mission Control: a unified operations and orchestration software stack for managing AI factories.
The DGX SuperPOD architecture integrates NVIDIA software solutions including NVIDIA Mission Control, NVIDIA AI Enterprise, CUDA, and NVIDIA Magnum IO™. These technologies help keep the system running at the highest levels of availability, performance, and with NVIDIA Enterprise Support (NVEX), keeps all components and applications running smoothly.
This reference architecture (RA) discusses the components that define the scalable and modular architecture of DGX SuperPOD. The system is built on the concept of scalable units (SU), each containing 64 DGX B300 systems, which provides for rapid deployment of systems of multiple sizes. This RA includes details regarding the SU design and specifics of InfiniBand, NVLink network, Ethernet fabric topologies, storage system specifications, recommended rack layouts, and wiring guides.
The NVIDIA DGX B300 is offered in a DC Busbar version as well as the more traditional AC Power supply version as do many of the fabric and other components. This RA specifically focuses on the DC Busbar version for modern and power-efficient datacenters. The centralized power shelves used for DC power supply provide better power efficiency while keeping the required performance and redundancy for AI Factories.
More information on the NVIDIA DGX B300 power supply version and other AC powered components can be found at: https://www.nvidia.com/en-us/data-center/dgx-superpod/.
Scope of DGX SuperPOD#
NVIDIA DGX SuperPOD is a product that is defined by this RA with specific bill of materials for both hardware and software. It is designed for single-tenant, multi-user enterprise environment.
The designed mode of operation for DGX SuperPOD is as follows:
NVIS led installation and white-glove services for initial bring up and commissioning
Customer-owned infrastructure-as-a-product, where the customer is responsible for day-to-day operation with premium support from NVIDIA Technical Account Manager and continuous bring-up service.
Functional and performance updates during the product lifecycle.
Administration of the DGX SuperPOD with NVIDIA Mission Control software.
User and application access using SLURM or NVIDIA Run:AI.
For customers who wish to modify their SuperPOD after delivery for additional capabilities such as general Kubernetes cluster access, or reconfiguration into a multi-tenant environment for enterprise private cloud or public cloud services, modifications are performed at customers’ own discretion and risk. NVIDIA does not support these capabilities and performance with the DGX SuperPOD as a product. However, enterprise-level support will still be available for individual components (for both software and hardware) that are available in the SuperPOD.