Key Components of the DGX SuperPOD

The DGX SuperPOD architecture has been designed to maximize performance for state-of-the-art model training, scale to exaflops of performance, provide the highest performance to storage and support all customers in the enterprise, higher education, research, and the public sector. It is a digital twin of the main NVIDIA research and development system, meaning the company’s software, applications, and support structure are first tested and vetted on the same architecture. Using SUs, system deployment times are reduced from months to weeks. Leveraging the DGX SuperPOD designs reduces time-to-solution and time-to-market of next generation models and applications.

The DGX SuperPOD is the integration of key NVIDIA components, as well as storage solutions from partners certified to work in a DGX SuperPOD environment.

NVIDIA DGX H100 System

The NVIDIA DGX H100 system (Figure 1) is an AI powerhouse that enables enterprises to expand the frontiers of business innovation and optimization. The DGX H100 system, which is the fourth-generation NVIDIA DGX system, delivers AI excellence in an eight GPU configuration. The NVIDIA Hopper GPU architecture provides latest technologies such as the transformer engines and fourthgeneration NVLink technology that brings months of computational effort down to days and hours, on some of the largest AI/ML workloads.

Figure 1. DGX H100 system

_images/dgx-h100-01.png

Some of the key highlights of the DGX H100 system over the DGX A100 system include:

  • Up to 9X more performance with 32 petaFLOPS at FP8 precision.

  • Dual 56-core 4th Gen Intel® Xeon® capable processors with PCIe 5.0 support and DDR5 memory.

  • 2X faster networking and storage @ 400 Gbps InfiniBand/Ethernet with NVIDIA ConnectX®-7 smart network interface cards (SmartNICs).

  • 1.5X higher bandwidth per GPU @ 900 GBps with fourth generation of NVIDIA NVLink.

  • 640 GB of aggregated HBM3 memory with 24 TB/s of aggregate memory bandwidth, 1.5X higher than DGX A100 system.

NVIDIA InfiniBand Technology

InfiniBand is a high-performance, low latency, RDMA capable networking technology, proven over 20 years in the harshest compute environments to provide the best inter-node network performance. Driven by the InfiniBand Trade Association (IBTA), it continues to evolve and lead data center network performance.

The latest generation InfiniBand, NDR, has a peak speed of 400 Gbps per direction. It is backwards compatible with the previous generations of InfiniBand specifications. InfiniBand is more than just peak performance. InfiniBand provides additional features to optimize performance including adaptive routing (AR), collective communication with SHARPTM, dynamic network healing with SHIELDTM, and supports several network topologies including fat-tree, Dragonfly, and multi-dimensional Torus to build the largest fabrics and compute systems possible.

Runtime and System Management

The DGX SuperPOD RA represents the best practices for building high-performance data centers. There is flexibility in how these systems can be presented to customers and users. NVIDIA Base Command software is used to manage all DGX SuperPOD deployments.

DGX SuperPOD can be deployed on-premises, meaning the customer owns and manages the hardware as a traditional system. This can be within a customer’s data center or co-located at a commercial data center, but the customer owns the hardware. For on-premises solutions, the customer has the option to operate the system with a secure, cloud-native interface through NVIDIA NGC™.

Components

The hardware components of the DGX SuperPOD are described in Table 1. The software compoents are shown in Table 2.

Table 1. DGX SuperPOD / 4 SU configuration components

Component

Technology

Description

Compute nodes

NVIDIA DGX H100 system with eight 80 GB H100 GPUs

Fourth generation of the world’s premier purpose-built AI systems featuring NVIDIA H100 Tensor Core GPUs, fourth-generation NVIDIA NVLink and third-generation NVIDIA NVSwitch™ technologies.

Compute fabric

NVIDIA Quantum QM9700 NDR 400 Gbps InfiniBand

Rail-optimized, full fat-tree network with eight NDR400 connections per system

Storage fabric

NVIDIA Quantum QM9700 NDR 400 Gbps InfiniBand

The fabric is optimized to match peak performance of the configured storage array

Compute/storage fabric management

NVIDIA Unified Fabric Manager, Enterprise Edition

NVIDIA UFM combines enhanced, real-time network telemetry with AI powered cyber intelligence and analytics to manage scale-out InfiniBand data centers

In-band management network

NVIDIA SN4600C switch

64 port 100 Gbps Ethernet switch providing high port density with high performance

Out-of-band (OOB) management network

NVIDIA SN2201 switch

48 port 1 Gbps Ethernet switch leveraging copper ports to minimize complexity

Table 2. DGX SuperPOD software components

Component

Description

NVIDIA Base Command Manager <https://docs.nvidia.com/base-command-manager/index.html>

Offers comprehensive cluster management solution for heterogeneous high-performance computing (HPC) and AI server clusters. It automates provisioning and administration, and supports clusters into the thousands of nodes

NVIDIA Base Command Platform <https://www.nvidia.com/en-us/data-center/base-command-platform/>

A software service for enterprise-class AI training that enables businesses and their data scientists to accelerate AI development

NVIDIA AI Enterprise

Best-in-class development tools and frameworks for the AI practitioner and reliable management and orchestration for IT professionals

Magnum IO

Enables increased performance for AI and HPC

NVIDIA NGC

The NGC catalog provides a collection of GPU-optimized containers for AI and HPC

Slurm

Slurm is a classic workload manager used to orchestrate complex workloads in a multi-node, batch-style, compute environment

Design Requirements

The DGX SuperPOD is designed to minimize system bottlenecks throughout the tightly coupled configuration to provide the best performance and application scalability. Each subsystem has been thoughtfully designed to meet this goal. In addition, the overall design remains flexible so that data center requirements can be tailored to better integrate into existing data centers.

System Design

The DGX SuperPOD is optimized for a customers’ particular workload of multi-node AI, HPC, and Hybrid applications:

  • A modular architecture based on SUs of 32 DGX H100 systems each.

  • A fully tested system scales to four SUs, but larger deployments can be built based on customer requirements.

  • Rack design can support one, two, or four DGX H100 systems per rack, so that the rack layout can be modified to accommodate different data center requirements.

  • Storage partner equipment that has been certified to work in DGX SuperPOD environments.

  • Full system support—including compute, storage, network, and software—is provided by NVIDIA Enterprise Support (NVES).

Compute Fabric

  • The compute fabric is rail-optimized to the top layer of the fabric.

  • The compute fabric is a balanced, full-fat tree.

  • Managed NDR switches are used throughout the design to provide better management of the fabric.

  • The fabric is designed to support the latest SHaRPv3 features.

Storage Fabric

The storage fabric provides high bandwidth to shared storage. It also has these characteristics:

  • It is independent of the compute fabric to maximize performance of both storage and application performance.

  • Provides single-node bandwisth of at least 40 GBps to each DGX H100 system.

  • Storage is provided over InfiniBand and leverages RDMA to provide maximum performance and minimize CPU overhead.

  • It is flexible and can scale to meet specific capacity and bandwidth requirements.

  • User-accessible management nodes provide access to shared storage.

In-Band Management Network

  • The in-band management network fabric is Ethernet-based and is used for node provisioning, data movement, Internet access, and other services that must be accessible by the users.

  • The in-band management network connections for compute and management servers operate at 100 Gbps and are bonded for resiliency.

Out-of-Band Management Network

The OOB management network connects all the base management controller (BMC) ports, as well as other devices that should be physically isolated from system users.

Storage Requirements

The DGX SuperPOD compute architecture must be paired with a high-performance, balanced, storage system to maximize overall system performance. The DGX SuperPOD is designed to use two separate storage systems, high-performance storage (HPS) and user storage, optimized for key operations of throughput, parallel I/O, as well as higher IOPS and metadata workloads.

High-Performance Storage

HPS must provide:

  • High-performance, resilient, POSIX-style file system optimized for multi-threaded read and write operations across multiple nodes.

  • Native InfiniBand support.

  • Local system RAM for transparent caching of data.

  • Leverage local disk transparently for caching of larger datasets.

User Storage

User storage must:

  • Be designed for high metadata performance, IOPS, and key enterprise features such as checkpointing. This is different than the HPS, which is optimized for parallel I/O and large capacity.

  • Communicate over Ethernet to provide a secondary path to storage so, that in the event of a failure of the storage fabric or HPS, nodes can still be accessed and managed by administrators in parallel.