Reference Architectures#
DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.
DGX BasePOD with NDR400 Compute Fabric#
DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.
The components of the DGX BasePOD are described in Table 1.
Component |
Technology |
---|---|
Compute nodes (2-8) |
NVIDIA DGX B200 system with eight 180 GB B200 GPUs, NDR400 InfiniBand networking and two NVIDIA BlueField-3 DPUs Or NVIDIA DGX H100 system with eight 80 GB H100 GPUs, NDR400 InfiniBand networking, and two NVIDIA ConnectX-7 NICs Or NVIDIA DGX H200 system with eight 141 GB H100 GPUs, NDR400 InfiniBand networking, and two NVIDIA ConnectX-7 NICs |
Compute fabric |
NVIDIA Quantum QM9700 NDR400 Gbps InfiniBand switch |
Management and storage fabric |
NVIDIA SN4600C switches |
OOB management fabric |
NVIDIA SN2201 switches |
Control plane |
See Control Plane |
System Architecture#
Figure 13 depicts the architecture for the DGX BasePOD for up to eight DGX nodes with NDR InfiniBand. BasePOD with DGX B200 and H200 and H100 systems use eight compute connections from each node running at NDR400. The complete architecture has three networks, an InfiniBand-based compute network, an Ethernet fabric for system management and storage, and an OOB management network.
Included in the reference architecture are five dual-socket x86 servers for system management. Two nodes are used as the head nodes for Base Command Manager. The three additional nodes provide the platform to house specific services for the deployment. This could be login nodes for a Slurm-based deployment, or Kubernetes for MLOps-based partner solutions. Any OEM server that meets the minimum requirements for each node described in Table 5 can be used. All management servers are configured in a high-availability (HA) pair (or triple), a failure of a single node won’t lead to the outage of the BasePOD service.
Switches and Cables#
Table 2 shows the number of cables and switches required for various deployments of DGX BasePOD. These designs are built with active optical cables or direct attached copper. Alternatively, DGX BasePOD may be deployed with transceivers and fiber cables.
Components |
Part Number [1] |
Number of DGX |
|
---|---|---|---|
4 |
8 |
||
NVIDIA Quantum QM9700 switch |
920-9B210-00FN-0M0 |
2 |
2 |
NDR Fiber Cables, 400 Gbps, DGX to IB Switches |
980-9I570-00N030 |
16 |
32 |
System 2x400G OSFP Flat-top Multimode Transceivers on DGX Systems |
980-9I51A-00NS00 |
16 |
32 |
Switch 2x400G OSFP Finned- top Multimode Transceivers |
980-9I510-00NS00 |
8 |
16 |
NDR InfiniBand DAC for Switch ISL |
980-9IA0J-00N002 |
16 |
32 |
NVIDIA SN2201 switch with Cumulus Linux, 48 RJ45 ports, P2C |
920-9N110-00F1-0C0 |
1 |
2 |
NVIDIA SN4600C switch with Cumulus Linux, 64 QSFP28 ports, P2C |
920-9N302-00F7-0C2 |
2 |
2 |
1 GbE Cat 6 Cables |
N/A |
29 |
45 |
NVIDIA active fiber cable, ETH 100GbE, 100Gb/s, QSFP, LSZH, 30m, DGX to Inband |
980-9I13N-00C030 |
8 |
16 |
100 Gbps QSFP Passive Cable for Inband Switch ISL |
980-9I54C-00V001 |
2 |
2 |
NVIDIA active fiber cable, ETH 100GbE, 100Gb/s, QSFP, LSZH, 10m, OOB to Inband |
980-9I13N-00C010 |
2 |
4 |
BCM management servers |
Varies |
5 |
5 |
NVIDIA active fiber cable, ETH 100GbE, 100Gb/s, QSFP, LSZH, 10m, Management Servers to Inband |
980-9I13N-00C010 |
10 |
10 |
Footnotes