Management Servers#

For DGX SuperPOD deployments with DGX RUBIN NVL8 systems, two means of cluster access are provided for end users (AI practitioners):

  • Using Slurm workload manager

  • Using Kubernetes and NVIDIA Run:ai

To support the operations, monitoring, and installation of DGX SuperPOD, the following set of control plane nodes is required:

  • Two nodes used for Base Command Manager in HA

  • Three nodes for Kubernetes services designated for administrator access

  • Three nodes for Kubernetes services designated for non-privileged user access (such as Run:ai)

  • Two nodes for Slurm login nodes

These nodes are connected as follows:

  • BCM in HA: connects to in-band and OOB networks and they are connected to each other back-to-back using a Precision Time Protocol (PTP) cable for heartbeat.

  • Admin services K8s management server: Three Nodes, connects to in-band and OOB network.

  • SLURM login nodes: Two nodes, connects to in-band and storage network.

  • Run:ai management servers: Three nodes, connects to in-band and storage network.

  • All devices connect to OOB network using 1 GbE for IPMI and Redfish as well.

The control plane nodes shall have the following minimum configuration requirement:

  • 2x x86 (amd64) based CPU, each 32-core minimum

  • 1TB of RAM

  • 2x NVMe 7.68TB for local storage

  • 2x 960GB SSD (RAID 1) for OS

  • 1x dual port BlueField-3 or BlueField-4 DPU

  • 2x 1GbE RJ45 Ports for PTP or additional low-speed networking connectivity

  • 1GbE RJ45 BMC/IPMI/Redfish, supporting remote media