Reference Architectures#

DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.

DGX BasePOD with NDR400 Compute Fabric#

DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.

The components of the DGX BasePOD are described in Table 1.

Table 1. DGX BasePOD Components#

Component

Technology

Compute nodes (2-8)

NVIDIA DGX B200 system with eight 180 GB B200 GPUs, NDR400 InfiniBand networking and two NVIDIA BlueField-3 DPUs

Or

NVIDIA DGX H100 system with eight 80 GB H100 GPUs, NDR400 InfiniBand networking, and two NVIDIA ConnectX-7 NICs

Or

NVIDIA DGX H200 system with eight 141 GB H100 GPUs, NDR400 InfiniBand networking, and two NVIDIA ConnectX-7 NICs

Compute fabric

NVIDIA Quantum QM9700 NDR400 Gbps InfiniBand switch

Management and storage fabric

NVIDIA SN4600C switches

OOB management fabric

NVIDIA SN2201 switches

Control plane

See Control Plane

System Architecture#

Figure 13 depicts the architecture for the DGX BasePOD for up to eight DGX nodes with NDR InfiniBand. BasePOD with DGX B200 and H200 and H100 systems use eight compute connections from each node running at NDR400. The complete architecture has three networks, an InfiniBand-based compute network, an Ethernet fabric for system management and storage, and an OOB management network.

_images/image16.png

Figure 13. DGX BasePOD with up to eight systems with NDR400#

Included in the reference architecture are five dual-socket x86 servers for system management. Two nodes are used as the head nodes for Base Command Manager. The three additional nodes provide the platform to house specific services for the deployment. This could be login nodes for a Slurm-based deployment, or Kubernetes for MLOps-based partner solutions. Any OEM server that meets the minimum requirements for each node described in Table 5 can be used. All management servers are configured in a high-availability (HA) pair (or triple), a failure of a single node won’t lead to the outage of the BasePOD service.

Switches and Cables#

Table 2 shows the number of cables and switches required for various deployments of DGX BasePOD. These designs are built with active optical cables or direct attached copper. Alternatively, DGX BasePOD may be deployed with transceivers and fiber cables.

Table 2. Switches and Cables#

Components

Part Number [1]

Number of DGX

4

8

NVIDIA Quantum QM9700 switch

920-9B210-00FN-0M0

2

2

NDR Fiber Cables, 400 Gbps, DGX to IB Switches

980-9I570-00N030

16

32

System 2x400G OSFP Flat-top Multimode Transceivers on DGX Systems

980-9I51A-00NS00

16

32

Switch 2x400G OSFP Finned- top Multimode Transceivers

980-9I510-00NS00

8

16

NDR InfiniBand DAC for Switch ISL

980-9IA0J-00N002

16

32

NVIDIA SN2201 switch with Cumulus Linux, 48 RJ45 ports, P2C

920-9N110-00F1-0C0

1

2

NVIDIA SN4600C switch with Cumulus Linux, 64 QSFP28 ports, P2C

920-9N302-00F7-0C2

2

2

1 GbE Cat 6 Cables

N/A

29

45

NVIDIA active fiber cable, ETH 100GbE, 100Gb/s, QSFP, LSZH, 30m, DGX to Inband

980-9I13N-00C030

8

16

100 Gbps QSFP Passive Cable for Inband Switch ISL

980-9I54C-00V001

2

2

NVIDIA active fiber cable, ETH 100GbE, 100Gb/s, QSFP, LSZH, 10m, OOB to Inband

980-9I13N-00C010

2

4

BCM management servers

Varies

5

5

NVIDIA active fiber cable, ETH 100GbE, 100Gb/s, QSFP, LSZH, 10m, Management Servers to Inband

980-9I13N-00C010

10

10

Footnotes