Management Servers#

For DGX SuperPOD with B300, we provide 2 means of Cluster Access for End Users (AI Practitioners):

  • With SLURM – best fit for pretraining-type of workflow, with bare-metal or containerized access to the compute infrastructure.

  • With Kubernetes and NVIDIA Run:AI.

To support the operations, monitoring, and installation of the DGX SuperPOD, a set of management servers is required.

Management Server Quantities and Connectivity#

These nodes are used for the following purposes:

  • Base Command Manager in High Availability (HA): 2 Nodes, connects to Inband and OOB

  • K8s Management Server: 3 Nodes, connects to Inband and storage

  • SLURM Login Nodes, 2 Nodes, connects to Inband and storage

  • All devices connect to OOB with 1GbE for IPMI/Redfish as well.

Table 6 Control Plane Node Form Factor Requirement#

System

Power Supply, Form Factor

B300 PDU

AC, EIA

B300 Busbar

DC, MGX