Reference Architectures
DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.
DGX BasePOD with NDR200 Compute Fabric
DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.
The components of the DGX BasePOD are described in Table 5.
Component |
Technology |
Compute nodes (2-16) |
NVIDIA DGX B200 system with eight 180 GB B200 GPUs and NDR200 InfiniBand networking or NVIDIA DGX H100 system with eight 80 GB H100 GPUs and NDR200 InfiniBand networking Or NVIDIA DGX H200 system with eight 141 GB H100 GPUs and NDR200 InfiniBand networking |
Compute fabric |
NVIDIA Quantum QM9700 NDR400 Gbps InfiniBand switch |
Management and storage fabric |
NVIDIA SN4600C switches |
OOB management fabric |
NVIDIA SN2201 switches |
Control plane |
See Control Plane |
System Architecture
Figure 20 depicts the architecture for the DGX BasePOD for up to 16 DGX nodes with NDR InfiniBand. BasePOD with DGX B200 and H200 and H100 systems use eight compute connections from each node running at NDR200. The complete architecture has three networks, an InfiniBand-based compute network, an Ethernet fabric for system management and storage, and an OOB management network.
Included in the reference architecture are five dual-socket x86 servers for system management. Two nodes are used as the head nodes for Base Command Manager. The three additional nodes provide the platform to house specific services for the deployment. This could be login nodes for a Slurm-based deployment, or Kubernetes for MLOps-based partner solutions. Any OEM server that meets the minimum requirements for each node described in Table 5 can be used. All management servers are configured in a high-availability (HA) pair (or triple), a failure of a single node won’t lead to the outage of the BasePOD service.
Switches and Cables
Table 6 shows the number of cables and switches required for various deployments of DGX BasePOD. These designs are built with active optical cables or direct attached copper. Alternatively, DGX BasePOD may be deployed with transceivers and fiber cables.
Components |
Part Number |
DGX Systems |
||
4 |
8 |
16 |
||
QM9700 InfiniBand switches |
QM9700 |
2 |
2 |
2 |
NDR200 MPO InfiniBand cable from DGX H200 and H100 systems to leaf switch |
MFP7E20-N0xx |
16 |
32 |
64 |
Dual Port twin-OSFP transceiver for DGX H200 and H100 system |
MMA4Z00-NS-FLT |
16 |
32 |
64 |
Dual Port OSFP transceiver for switch |
MMA4Z00-NS |
8 |
16 |
32 |
NDR InfiniBand DAC from leaf to leaf |
MCP4Y10-Nxxx |
4 |
8 |
16 |
SN2201 switches |
MSN2201-CB2FC |
1 |
2 |
2 |
SN4600C switches |
920- 9N302-00FA-0C0 |
2 |
2 |
2 |
1 GbE Cat 6 cables |
No specific requirement |
29 |
45 |
77 |
200 GbE AOC for DGX H200 and H100 systems |
MFS1S00-HxxxV |
8 |
16 |
32 |
200 GbE DAC for ISL |
M CP1650-VxxxE26 |
2 |
2 |
2 |
100 GbE cables OOB to in-band |
MFA1A00-Cxxx |
2 |
4 |
4 |
BCM management servers |
Varies |
5 |
5 |
5 |
100 GbE AOC for management servers |
MFA1A00-Cxxx |
10 |
10 |
10 |