Reference Architectures

DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.

DGX BasePOD with NDR200 Compute Fabric

DGX BasePOD is a flexible solution that offers multiple prescriptive architectures. These architectures are adaptable to support the evolving demands of AI workloads.

The components of the DGX BasePOD are described in Table 5.

Table 5. DGX BasePOD Components

Component

Technology

Compute nodes (2-16)

NVIDIA DGX B200 system with eight 180 GB B200 GPUs and NDR200 InfiniBand networking

or

NVIDIA DGX H100 system with eight 80 GB H100 GPUs and NDR200 InfiniBand networking

Or

NVIDIA DGX H200 system with eight 1.4 GB H100 GPUs and NDR200 InfiniBand networking

Compute fabric

NVIDIA Quantum QM9700 NDR400 Gbps InfiniBand switch

Management and storage fabric

NVIDIA SN4600C switches

OOB management fabric

NVIDIA SN2201 switches

Control plane

See Control Plane

System Architecture

Figure 20 depicts the architecture for the DGX BasePOD for up to 16 DGX nodes with NDR InfiniBand. BasePOD with DGX B200 and H200 and H100 systems use eight compute connections from each node running at NDR200. The complete architecture has three networks, an InfiniBand-based compute network, an Ethernet fabric for system management and storage, and an OOB management network.

_images/image16.png

Figure 20. DGX BasePOD with up to 16 systems with NDR200

Included in the reference architecture are five dual-socket x86 servers for system management. Two nodes are used as the head nodes for Base Command Manager. The three additional nodes provide the platform to house specific services for the deployment. This could be login nodes for a Slurm-based deployment, or Kubernetes for MLOps-based partner solutions. Any OEM server that meets the minimum requirements for each node described in Table 5 can be used. All management servers are configured in a high-availability (HA) pair (or triple), a failure of a single node won’t lead to the outage of the BasePOD service.

Switches and Cables

Table 6 shows the number of cables and switches required for various deployments of DGX BasePOD. These designs are built with active optical cables or direct attached copper. Alternatively, DGX BasePOD may be deployed with transceivers and fiber cables.

Table 6. Switches and Cables

Components

Part Number

DGX Systems

4

8

16

QM9700 InfiniBand switches

QM9700

2

2

2

NDR200 MPO InfiniBand cable from DGX H200 and H100 systems to leaf switch

MFP7E20-N0xx

16

32

64

Dual Port twin-OSFP transceiver for DGX H200 and H100 system

MMA4Z00-NS-FLT

16

32

64

Dual Port OSFP transceiver for switch

MMA4Z00-NS

8

16

32

NDR InfiniBand DAC from leaf to leaf

MCP4Y10-Nxxx

4

8

16

SN2201 switches

MSN2201-CB2FC

1

2

2

SN4600C switches

920- 9N302-00FA-0C0

2

2

2

1 GbE Cat 6 cables

No specific requirement

29

45

77

200 GbE AOC for DGX H200 and H100 systems

MFS1S00-HxxxV

8

16

32

200 GbE DAC for ISL

M CP1650-VxxxE26

2

2

2

100 GbE cables OOB to in-band

MFA1A00-Cxxx

2

4

4

BCM management servers

Varies

5

5

5

100 GbE AOC for management servers

MFA1A00-Cxxx

10

10

10