Management Servers#
For DGX SuperPOD deployments with DGX RUBIN NVL8 systems, two means of cluster access are provided for end users (AI practitioners):
Using Slurm workload manager
Using Kubernetes and NVIDIA Run:ai
To support the operations, monitoring, and installation of DGX SuperPOD, the following set of control plane nodes is required:
Two nodes used for Base Command Manager in HA
Three nodes for Kubernetes services designated for administrator access
Three nodes for Kubernetes services designated for non-privileged user access (such as Run:ai)
Two nodes for Slurm login nodes
These nodes are connected as follows:
BCM in HA: connects to in-band and OOB networks and they are connected to each other back-to-back using a Precision Time Protocol (PTP) cable for heartbeat.
Admin services K8s management server: Three Nodes, connects to in-band and OOB network.
SLURM login nodes: Two nodes, connects to in-band and storage network.
Run:ai management servers: Three nodes, connects to in-band and storage network.
All devices connect to OOB network using 1 GbE for IPMI and Redfish as well.
The control plane nodes shall have the following minimum configuration requirement:
2x x86 (amd64) based CPU, each 32-core minimum
1TB of RAM
2x NVMe 7.68TB for local storage
2x 960GB SSD (RAID 1) for OS
1x dual port BlueField-3 or BlueField-4 DPU
2x 1GbE RJ45 Ports for PTP or additional low-speed networking connectivity
1GbE RJ45 BMC/IPMI/Redfish, supporting remote media