Hardware Overview#
DGX BasePOD deployment in this example consists of compute nodes, five control plane servers (two for cluster management and three Kubernetes (K8s) control plane nodes), as well as associated storage and networking infrastructure.
An overview of the hardware is in Table 1. Details about the hardware that can be used and how it should be cabled are given in the NVIDIA DGX BasePOD Reference Architecture.
This deployment guide describes the steps necessary for configuring and testing a four-node DGX BasePOD after the physical installation has taken place. Minor adjustments to specific configurations will be needed for DGX BasePOD deployments of different sizes, and to tailor for different customer environments, but the overall procedure described in this document should be largely applicable to any DGX deployments.
Table 1. DGX BasePOD components
Component |
Technology |
---|---|
Compute nodes |
DGX H200/H100 system |
Compute fabric |
NVIDIA Quantum QM9700 InfiniBand switches |
Management fabric |
NVIDIA SN4600C switches |
Storage fabric |
Option 1: NVIDIA SN4600C switches for Ethernet attached storage Option 2: NVIDIA Quantum QM9700 switches for InfiniBand attached storage |
Out-of-band management fabric |
NVIDIA SN2201 switches |
Control plane and workload management nodes |
Minimum Requirements (each server): > 64-bit x86 processor, AMD EPYC 7272 or equivalent > 256 GB memory > 1 TB SSD > Two 100 Gbps network ports |