Key Components of the DGX SuperPOD

The DGX SuperPOD architecture has been designed to maximize performance for state-of-the-art model training, scale to exaflops of performance, provide the highest performance to storage and support all customers in the enterprise, higher education, research, and the public sector. It is a digital twin of the main NVIDIA research and development system, meaning the company’s software, applications, and support structure are first tested and vetted on the same architecture. Using SUs, system deployment times are reduced from months to weeks. Leveraging the DGX SuperPOD design reduces time-to-solution and time-to-market of next generation models and applications.

DGX SuperPOD is the integration of key NVIDIA components, as well as storage solutions from partners certified to work in a DGX SuperPOD environment.

NVIDIA DGX B200 System

The NVIDIA DGX B200 system (Figure 1) is an AI powerhouse that enables enterprises to expand the frontiers of business innovation and optimization. The DGX B200 system delivers breakthrough AI performance with the most powerful chips ever built, in an eight GPU configuration. The NVIDIA Blackwell GPU architecture provides the latest technologies that brings months of computational effort down to days and hours, on some of the largest AI/ML workloads.

_images/image3.png

Figure 1. DGX B200 system

Some of the key highlights of the DGX B200 system over the DGX H100 system include:

  • 72 petaFLOPS FP8 training and 144 petaFLOPS FP4 inference

  • 2x dual-port QSFP112 NVIDIA BlueField-3 DPU

  • Fifth generation of NVIDIA NVLink.

  • 1,440 GB of aggregated HBM3 memory

NVIDIA InfiniBand Technology

InfiniBand is a high-performance, low latency, RDMA capable networking technology, proven over 20 years in the harshest compute environments to provide the best inter-node network performance. InfiniBand continues to evolve and lead data center network performance.

The latest generation InfiniBand, NDR, has a peak speed of 400 Gbps per direction with an extremely low port-to-port latency. It is backwards compatible with the previous generations of InfiniBand specifications. InfiniBand is more than just peak bandwidth and low latency. InfiniBand provides additional features to optimize performance including adaptive routing (AR), collective communication with SHARP™, dynamic network healing with SHIELD™, and supports several network topologies including fat-tree, Dragonfly, and multi-dimensional Torus to build the largest fabrics and compute systems possible.

Runtime and System Management

The DGX SuperPOD RA represents the best practices for building high-performance data centers. There is flexibility in how these systems can be presented to customers and users. NVIDIA Base Command Manager software is used to manage all DGX SuperPOD deployments.

DGX SuperPOD can be deployed on-premises, meaning the customer owns and manages the hardware as a traditional system. This can be within a customer’s data center or co-located at a commercial data center, but the customer owns the hardware.

Components

The hardware components of DGX SuperPOD are described in Table 1. The software components are shown in Table 2.

Table 1. DGX SuperPOD / 4 SU hardware components

Component

Technology

Description

Compute nodes

NVIDIA DGX B200 system with eight B200 GPUs

The world’s premier purpose-built AI systems featuring NVIDIA B200 Tensor Core GPUs, fifth-generation NVIDIA NVLink, and fourth-generation NVIDIA NVSwitch™ technologies.

Compute fabric

NVIDIA Quantum QM9700 NDR 400 Gbps InfiniBand

Rail-optimized, non-blocking, full fat-tree network with eight NDR400 connections per system

Storage fabric

NVIDIA Quantum QM9700 NDR 400 Gbps InfiniBand

The fabric is optimized to match peak performance of the configured storage array

Compute/storage fabric management

NVIDIA Unified Fabric Manager Appliance, Enterprise Edition

NVIDIA UFM combines enhanced, real-time network telemetry with AI powered cyber intelligence and analytics to manage scale-out InfiniBand data centers

In-band management network

NVIDIA SN4600C switch

64 port 100 Gbps Ethernet switch providing high port density with high performance

Out-of-band (OOB) management network

NVIDIA SN2201 switch

48 port 1 Gbps Ethernet switch leveraging copper ports to minimize complexity

Table 2. DGX SuperPOD software components

Component

Description

NVIDIA Base Command Manager

Comprehensive AI infrastructure management for AI clusters. It automates provisioning and administration and supports cluster sizes into the thousands of nodes.

NVIDIA AI Enterprise

Best-in-class development tools and frameworks for the AI practitioner and reliable management and orchestration for IT professionals

Magnum IO

Enables increased performance for AI and HPC

NVIDIA NGC

The NGC catalog provides a collection of GPU-optimized containers for AI and HPC

Slurm

A classic workload manager used to manage complex workloads in a multi-node, batch-style, compute environment

Design Requirements

DGX SuperPOD is designed to minimize system bottlenecks throughout the tightly coupled configuration to provide the best performance and application scalability. Each subsystem has been thoughtfully designed to meet this goal. In addition, the overall design remains flexible so that data center requirements can be tailored to better integrate into existing data centers.

System Design

DGX SuperPOD is optimized for a customers’ particular workload of multi-node AI and HPC applications:

  • A modular architecture based on SUs of 32 DGX B200 systems each.

  • A fully tested system scales to four SUs, but larger deployments can be built based on customer requirements.

  • Rack design can support two DGX B200 systems per rack, so that the rack layout can be modified to accommodate different data center requirements.

  • Storage partner equipment that has been certified to work in DGX SuperPOD environments.

  • Full system support—including compute, storage, network, and software—is provided by NVIDIA Enterprise Support (NVEX).

Compute Fabric

  • The compute fabric is rail-optimized to the top layer of the fabric.

  • The compute fabric is a balanced, full-fat tree.

  • Managed NDR switches are used throughout the design to provide better management of the fabric.

  • The fabric is designed to support the latest SHaRPv3 features.

Storage Fabric

The storage fabric provides high bandwidth to shared storage. It also has the following characteristics:

  • It is independent of the compute fabric to maximize performance of both storage and application performance.

  • Provides single-node bandwidth of at least 40 GBps to each DGX B200 system.

  • Storage is provided over InfiniBand and leverages RDMA to provide maximum performance and minimize CPU overhead.

  • It is flexible and can scale to meet specific capacity and bandwidth requirements.

  • User-accessible management nodes provide access to shared storage.

In-Band Management Network

  • The in-band management network fabric is Ethernet-based and is used for node provisioning, data movement, Internet access, and other services that must be accessible by the users.

  • The in-band management network connections for compute and management servers operate at 100 Gbps and are bonded for resiliency.

Out-of-Band Management Network

The OOB management network connects all the base management controller (BMC) ports, as well as other devices that should be physically isolated from system users.

Storage Requirements

The DGX SuperPOD compute architecture must be paired with a high-performance, balanced, storage system to maximize overall system performance. DGX SuperPOD is designed to use two separate storage systems, high-performance storage (HPS) and user storage, optimized for key operations of throughput, parallel I/O, as well as higher IOPS and metadata workloads.

High-Performance Storage

HPS must provide:

  • High-performance, resilient, POSIX-style file system optimized for multi-threaded read and write operations across multiple nodes.

  • Native InfiniBand support.

  • Local system RAM for transparent caching of data.

  • Leverage local disk transparently for read and write caching.

User Storage

User storage must:

  • Be designed for high metadata performance, IOPS, and key enterprise features such as checkpointing. This is different than the HPS, which is optimized for parallel I/O and large capacity.

  • Communicate over Ethernet to provide a secondary path to storage so, that in the event of a failure of the storage fabric or HPS, nodes can still be accessed and managed by administrators in parallel.