NVIDIA DGX SuperPOD Overview

The NVIDIA DGX SuperPOD™ is a multi-user system designed to run large artificial intelligence (AI) and high-performance computing (HPC) applications efficiently. While the system is composed of many different components, it should be thought of as a single system that can manage simultaneous use by many users and provide advanced access controls for queuing and scheduling resources. This ensures maximum performance, provides the tools for collaboration between users, and security controls to protect data and limit user interaction where necessary.

This document does not cover information about the DGX SuperPOD that is specific to local policies or general Unix/Linux topics such as access, queuing, quotas, compiling, and editing and manipulating files and data.

Logical System Diagram

Figure 1 provides a logical depiction of the DGX SuperPOD and all the components that enable it to work as a single multi-user system.

Figure 1. Logical depiction of the DGX SuperPOD

_images/overview-01.png

The boxes and connections in Figure 1 indicate that these components are not a part of the userexperience. Any lines that are dotted indicate that there is some connectivity between the two resources, but not necessarily every sub-component is connected. The optional jump box is an optional component outside of the DGX SuperPOD that enables remote access into it.

Components from Figure 1 are further described in Table 1.

Table 1. DGX SuperPOD Components

Component

Description

Jump Box/Entry Point

The Jump Box/Entry Point is the gateway into the DGX SuperPOD intended to provide a single-entry point into the cluster and additional security when required. It is not actually apart of the DGX SuperPOD, but of the corporate IT environment. This function is defined by local IT requirements.

Management Nodes

The management nodes are the entry point for the user into the DGX SuperPOD. A login node is a CPU-only node for light-weight tasks where the user can develop code, submit, and monitor jobs, and manage your data.

Compute Nodes

The compute nodes are where the user work gets done on the system. Each compute node is an individual server, but with the high-speed fabric, applications can be efficiently spread out across multi-nodes.

High-Speed Storage

High-speed storage is optimized for efficient reading and writing of large data files. The high-speed storage is often treated as scratch, as it is difficult or impossible to back-up all data stored on the system. This is where datasets, checkpoints, and other large files should be stored.

Home File System

The home file system is a traditional, highly reliable network file system that trades performance for stability and enterprise management features. The assigned space is generally smaller than what is available on high-speed storage. Users should store scripts, code, Dockerfiles, and other small and important files.

Compute Fabric

The compute fabric is the high-speed network fabric connecting all the compute nodes together to enable high-bandwidth and low-latency communication between the compute nodes.

Storage Fabric

The storage fabric is the high-speed network fabric dedicated for storage traffic. Storage traffic is dedicated to its own fabric to remove interference with the node-to-node application traffic that can degrade overall performance.

In-band Management Network

The In-band Network Fabric provides fast Ethernet connectivity between all the nodes in the DGX SuperPOD. While its use should be transparent to the users, it hosts important traffic for node management and home file system access.