NVIDIA DGX SuperPOD Overview#

The NVIDIA DGX SuperPOD™ is a multi-user system designed to run large artificial intelligence (AI) and high-performance computing (HPC) applications efficiently. While the system is composed of many different components, it should be thought of as a single system that can manage simultaneous use by many users and provide advanced access controls for queuing and scheduling resources. This ensures maximum performance, provides the tools for collaboration between users, and security controls to protect data and limit user interaction where necessary.

This document does not cover information about the DGX SuperPOD that is specific to local policies or general Unix/Linux topics such as access, queuing, quotas, compiling, and editing and manipulating files and data.

Logical System Diagram#

Figure 1 provides a logical depiction of the DGX SuperPOD and all the components that enable it to work as a single multi-user system.

Figure 1. Logical depiction of the DGX SuperPOD

The boxes and connections in Figure 1 indicate that these components are not a part of the userexperience. Any lines that are dotted indicate that there is some connectivity between the two resources, but not necessarily every sub-component is connected. The optional jump box is an optional component outside of the DGX SuperPOD that enables remote access into it.

Components from Figure 1 are further described in Table 1.

Table 1. DGX SuperPOD Components

Component	Description
Jump Box/Entry Point	The Jump Box/Entry Point is the gateway into the DGX SuperPOD intended to provide a single-entry point into the cluster and additional security when required. It is not actually apart of the DGX SuperPOD, but of the corporate IT environment. This function is defined by local IT requirements.
Management Nodes	The management nodes are the entry point for the user into the DGX SuperPOD. A login node is a CPU-only node for light-weight tasks where the user can develop code, submit, and monitor jobs, and manage your data.
Compute Nodes	The compute nodes are where the user work gets done on the system. Each compute node is an individual server, but with the high-speed fabric, applications can be efficiently spread out across multi-nodes.
High-Speed Storage	High-speed storage is optimized for efficient reading and writing of large data files. The high-speed storage is often treated as scratch, as it is difficult or impossible to back-up all data stored on the system. This is where datasets, checkpoints, and other large files should be stored.
Home File System	The home file system is a traditional, highly reliable network file system that trades performance for stability and enterprise management features. The assigned space is generally smaller than what is available on high-speed storage. Users should store scripts, code, Dockerfiles, and other small and important files.
Compute Fabric	The compute fabric is the high-speed network fabric connecting all the compute nodes together to enable high-bandwidth and low-latency communication between the compute nodes.
Storage Fabric	The storage fabric is the high-speed network fabric dedicated for storage traffic. Storage traffic is dedicated to its own fabric to remove interference with the node-to-node application traffic that can degrade overall performance.
In-band Management Network	The In-band Network Fabric provides fast Ethernet connectivity between all the nodes in the DGX SuperPOD. While its use should be transparent to the users, it hosts important traffic for node management and home file system access.

Navigating the DGX SuperPOD#

When a user first logs into the DGX SuperPOD, it will look like any other Linux system. They will be placed into their home directory and standard Linux commands will work.

For example:

# pwd
/home/dgxuser
# ls -al
./bashrc

In addition, the high-speed file system will be available on the login nodes and all the compute nodes:

# ls /lustre/fs1/
projects

The DGX SuperPOD is a collection of nodes and access is managed through the workload management system. The default workload management system is Slurm. Slurm enables submitting and managing jobs. See Workload Management for more details.

Note

The description regarding accessing the DGX SuperPOD is the default way of using the command line to interact with the system. Local deployments may provide other user interfaces for interacting with the system. In addition, the examples in this document use a standard naming convention for system names and directories but may be changed for a given environment.