Architecture#

This documentation is part of NVIDIA DGX BasePOD: Deployment Guide Featuring NVIDIA DGX A100 Systems.

Hardware Overview#

The DGX BasePOD consists of compute nodes, five control plane servers (two for cluster management and three Kubernetes (K8s) control plane nodes), as well as associated storage and networking infrastructure.

An overview of the hardware is in Table 1. Details about the hardware that can be used and how it should be cabled are given in the NVIDIA DGX BasePOD Reference Architecture.

This deployment guide describes the steps necessary for configuring and testing a four-node DGX BasePOD after the physical installation has taken place. Minor adjustments to specific configurations will be needed for DGX BasePOD deployments of different sizes, and to tailor for different customer environments, but the overall procedure described in this document should be largely applicable to any DGX BasePOD with NVIDIA A100 deployments.

Table 1. DGX BasePOD components#

Component

Technology

Compute nodes

DGX A100 system

Compute fabric

NVIDIA Quantum QM8700 HDR 200 Gbps InfiniBand

Management fabric

NVIDIA SN4600 switches

Storage fabric

NVIDIA SN4600 switches for Ethernet attached storage
NVIDIA Quantum QM8700 HDR 200 Gb/s for InfiniBand attached storage

Out-of-band management fabric

NVIDIA SN2201 switches

Control plane

Minimum Requirements (each server):

  • 64-bit x86 processor, AMD EPYC 7272 or equivalent

  • 256 GB memory

  • 1 TB SSD

  • Two 100 Gbps network ports

Networking#

This section covers the DGX system network ports and an overview of the networks used by DGX BasePOD.

DGX A100 System Network Ports#

Figure 1 shows the rear of the DGX A100 system with the network port configuration used in this solution guide.

DGX A100 system rear

Figure 1. DGX A100 system rear#

The following ports are selected for DGX BasePOD networking:

  • Four single-port ConnectX-6 cards are used for the InfiniBand compute fabric, two on either side of the chassis (marked in red).

  • Two ports of the dual-port ConnectX-6 cards are configured as a bonded Ethernet interface for in-band management and storage networks. These are the bottom port from slot 4 and the right port from slot 5 (marked in blue).

  • BMC network access is provided through the out-of-band network (marked in gray).

The networking ports and their mapping are described in the Network Ports section of the NVIDIA DGX A100 System User Guide.

DGX BasePOD Network Overview#

There are four networks in a DGX BasePOD configuration:

  • internalnet — Network used exclusively within the cluster, for storage and in-band management.

  • externalnet — Network connecting the DGX BasePOD to an external network, such as a corporate or campus network.

  • ipminet — Network for out of band management, connecting BMCs.

  • ibnet — InfiniBand network connecting all DGX systems’ ConnectX-6 Compute Fabric HCAs.

These are shown in Figure 3.

Network design and topology diagram

Figure 3. Network design and topology diagram#

internalnet and externalnet#

internalnet uses VLAN 122 and externalnet uses VLAN 121. Both VLANs are configured on the SN4600 switches, which are the backbone of the DGX BasePOD Ethernet networking. Each DGX system connects to the SN4600 switches with a bonded interface that consists of two physical interfaces; slot 4 bottom port (storage 4-2) and slot 5 right port (storage 5-2) as described in the Network Ports section of the NVIDIA DGX A100 System User Guide.

The K8s control plane nodes and the NFS storage device have a similar bonded interface configuration connected to SN4600 switches. Two SN4600 switches with Multi-chassis Link Aggregation (MLAG) provides the redundancy for DGX systems, K8s controller nodes, and other devices with bonded interfaces. Trunk mode is used to bond the interface with VLAN 122 as its native VLAN. Access mode is used on the port connected to the BCM head node. BGP protocols used between interfaces are described in Table 2. All connected subnets are redistributed into BGP.

Table 2. BGP protocols#

Protocol

Description

BGP

Used as required for routing between switches

iBGP

Configured between the two SN4600s using the MLAG peerlink.4094 interface

eBGP

Configured between the uplink (SN4600) and IPMI (SN2201) switches

ipminet#

On the ipminet switches, the gateway for VLAN 111 is configured and all the ports connected to the end hosts are configured as access ports for VLAN 111. Each BCM head node requires two interfaces connected to the IPMI switch; the first for the IPMI interface of the host and the second to be used as HOST OS’s direct access to the IPMI subnet. Uplinks are connected to TOR-01 and TOR-02 using unnumbered eBGP. All connected subnets are redistributed into BGP. IPMI switches can also be uplinked to a separate management network if required, rather than the TOR switches; still IPMI subnet route must be advertised to the in-band network so that BCM can control hosts using the IPMI network.

ibnet#

For the ibnet, NICs on physical DGX slot 0 and 2 are connected to QM8700-1 InfiniBand switch; and the NICs on physical DGX slot 6 and 8 are connected to QM8700-2 InfiniBand switch. To manage the InfiniBand fabric, a subnet manager is required; one of the 8700 switches must be configured as the subnet manager.

The networking ports and their mapping are described in the Network Ports section of the NVIDIA DGX A100 System User Guide.

Software#

Base Command Manager (BCM) is a key software component of DGX BasePOD. BCM is used to provision the OS on all hosts, deploy K8s, optionally deploy Jupyter, and provide monitoring and visibility of the cluster health.

An instance of BCM runs on a pair of head nodes in a High Availability (HA) configuration and is connected to all other nodes in the DGX BasePOD.

DGX systems within a DGX BasePOD have a DGX OS image installed by BCM. Similarly, the K8s control plane nodes are imaged by BCM with an Ubuntu LTS version equivalent to that of the DGX OS and the head nodes themselves.

Kubernetes#

Kubernetes (K8s) is an open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts. With K8s, it is possible to:

  • Scale applications on the fly.

  • Seamlessly update running services.

  • Optimize hardware availability by using only the needed resources.

The cluster manager provides the administrator with the required packages, allows K8s to be set up, and manages and monitors K8s.

(Optional) Jupyter#

BCM can optionally deploy and manage Jupyter, consisting of four major components and several extensions. The major components are: Jupyter Notebook, JupyterLab, JupyterHub, and Jupyter Enterprise Gateway.

These are the Jupyter extensions that BCM deploys:

  • Template specialization extension—create a custom Jupyter kernel without editing text files.

  • Job management extension—manage jobs from within the Jupyter interface.

  • VNC extension—interact with the X display of the execution server (including the desktop) from within the Jupyter interface.

  • K8s operators—Jupyter kernel, PostgreSQL, and Spark operators.

  • Jupyter dev server—Proxy server that enables developing applications in alternative editors while the computational workload is proxied to their Jupyter notebook running on the cluster.

Storage#

An NFS solution is required for a highly available (HA) BCM installation, and the required export path for that is described in this DGX BasePOD document. A DGX BasePOD typically also includes dedicated storage, but the configuration of that is outside the scope of this document. Contact the vendor of the storage solution being used for instructions on configuring the high performance storage portions of a DGX BasePOD.