Management Servers#

For DGX SuperPOD deployments with DGX RUBIN NVL8 systems, two means of cluster access are provided for end users (AI practitioners):

To support the operations, monitoring, and installation of DGX SuperPOD, the following set of control plane nodes is required:

Two nodes used for Base Command Manager in HA
Three nodes for Kubernetes services designated for administrator access
Three nodes for Kubernetes services designated for non-privileged user access (such as Run:ai)
Two nodes for Slurm login nodes

These nodes are connected as follows:

BCM in HA: connects to in-band and OOB networks and they are connected to each other back-to-back using a Precision Time Protocol (PTP) cable for heartbeat.
Admin services K8s management server: Three Nodes, connects to in-band and OOB network.
SLURM login nodes: Two nodes, connects to in-band and storage network.
Run:ai management servers: Three nodes, connects to in-band and storage network.
All devices connect to OOB network using 1 GbE for IPMI and Redfish as well.

The control plane nodes shall have the following minimum configuration requirement: