Core Components#
The compute nodes with HCAs and switch resources form the foundation of the DGX BasePOD. The specific components used in the DGX BasePOD Reference Architectures are described in this section.
NVIDIA DGX Systems#
NVIDIA DGX BasePOD configurations use DGX B200, DGX H200, and H100 systems. The systems are described in the following sections.
NVIDIA DGX B200 System#
The NVIDIA DGX B200 system (Figure 4) offers unprecedented compute density, performance, and flexibility.
Key specifications of the DGX B200 system are:
Built with eight NVIDIA B200
1.4TB of GPU memory space
4x OSFP ports serving 8x single-port NVIDIA ConnectX-7 VPI
Up to 400Gb/s InfiniBand/Ethernet 2x dual-port QSFP112 NVIDIA BlueField-3 DPU
Dual 5th generation Intel® Xeon® Scalable Processors
Rear ports of the DGX B200 CPU tray are shown in Figure 5.
Four of the ConnectX-7 OSFP are used for the compute fabric. Each pair of dual-port BlueField-3 HCAs (NIC mode) provide parallel pathways to the storage and management fabrics. The out-of-band (OOB) port is used for BMC access.
NVIDIA DGX H200 and H100 Systems#
The DGX H200 system (Figure 6) is the latest DGX system and the AI powerhouse that is accelerated by the groundbreaking performance of the NVIDIA Hopper GPU.
Key specifications of the DGX H200 and H100 system are:
Eight NVIDIA Hopper GPUs.
1,128 GB total GPU memory for H200.
640 GB total GPU memory for H100.
Four NVIDIA NVSwitch™ chips.
Dual Intel® Xeon® Platinum 8480C processors, 112 cores total, 2.00 GHz (Base), 3.80 GHz (Max Boost) with PCIe 5.0 support.
2 TB of DDR5 system memory.
Four OSFP ports serving eight single-port NVIDIA ConnectX-7 VPI, 2x dual-port QSFP112 NVIDIA ConnectX-7 VPI, up to 400 Gb/s InfiniBand/Ethernet.
10Gb/s onboard NIC with RJ45, 100 Gb/s Ethernet NIC, BMC with RJ45.
Two 1.92 TB M.2 NVMe drives for DGX OS, eight 3.84 TB U.2 NVMe drives for storage/cache.
The rear ports of the DGX H200 and H100 CPU tray are shown in Figure 7.
Four of the OSFP ports serve eight ConnectX-7 HCAs for the compute fabric. Each pair of dual-port ConnectX-7 HCAs provide parallel pathways to the storage and management fabrics. The OOB port is used for BMC access.
NVIDIA Networking Adapters#
NVIDIA DGX B200 and DGX H200 and H100 systems are equipped with NVIDIA® ConnectX®-7 network adapters. The DGX B200 has both ConnectX-7 and NVIDIA BlueField-3 network adapters. The network adapters are described in this section.
Note
Going forward, HCA will refer to network adapter cards configured for InfiniBand and NIC for those configured for Ethernet.
NVIDIA Networking Adapters#
The ConnectX-7 VPI Adapter (Figure 8) is the latest ConnectX Adapter line. It can provide 25/50/100/200/400G of throughput. NVIDIA DGX system use ConnectX-7 and BlueField-3 (NIC Mode) HCAs to provide flexibility in DGX BasePOD deployments with NDR400 and RoCE. Specifications are available here.
NVIDIA Networking Switches#
DGX BasePOD configurations can be equipped with four types of NVIDIA networking switches. The switches are described in this section, with how the switches are being deployed in the Reference Architectures section.
NVIDIA QM9700 Switch#
NVIDIA QM9700 switches (Figure 10) with NDR InfiniBand connectivity power the compute fabric in NDR BasePOD configurations. ConnectX-7 single-port adapters are used for the InfiniBand compute fabric. Each NVIDIA DGX system has dual connections to each QM9700 switch, providing multiple high-bandwidth, low-latency paths between the systems.
NVIDIA SN4600C Switch#
NVIDIA SN4600C Switches (Figure 11) offer 128 total ports (64 per switch) to provide redundant connectivity for in-band management of the DGX BasePOD. The NVIDIA SN4600C switch can provide for speeds between 1 GbE and 100 GbE.
For storage appliances connected over Ethernet, the NVIDIA SN4600 switches are also used. The ports on the NVIDIA DGX dual-port network adapters are used for both in-band management and storage connectivity.
NVIDIA SN2201 Switch#
NVIDIA SN2201 switches (Figure 12) offer 48 ports to provide connectivity for OOB management. OOB management provides consolidated management connectivity for all components in BasePOD.
Control Plane#
The minimum requirements for each server in the control plane are:
2 × Intel x86 Xeon Gold or better
512 GB memory
1 × 6.4 TB NVMe for storage
2 × 480 GB M.2 RAID for OS
4 × 200 Gbps network
2 × 100 GbE network