Compute Node Hardware#

The Software Reference Architecture is comprised of individually optimized NVIDIA-Certified System servers that follow a prescriptive design pattern to ensure optimal performance when deployed in a cluster environment. There are currently three types of server configurations for which Enterprise RAs are designed: PCIe Optimized 2-4-3, PCIe Optimized 2-8-5, and HGX systems. For the PCIe Optimized configurations (for example, 2-8-5), the respective digits refer to the number of sockets (CPUs), the number of GPUs, and the number of network adapters. For Further details on Enterprise RA designs, refer to the NVIDIA Enterprise Reference Architecture Overview Whitepaper.

H200 NVL Systems#

The NVIDIA AI Enterprise RA leverages an H200 NVL PCIe Optimized 2-8-5 reference configuration. Additional detailed reference configurations for 2-4-3 & 2-8-9 HGX systems can be made available via request. Additional types of systems with L40S, L40, L4, H100/H200 can also be used and will have different configurations

Note

The NVIDIA H200 NVL includes an NVIDIA AI Enterprise License and can be activated through NGC. Not all supported GPUs include an NVIDIA AI Enterprise License.

Diagram of PCIe Optimized 2-8-5 configuration with NVIDIA H200 NVL.

_images/computer-node-hardware-01.png

Components for the H200 NVL NVIDIA-Certified system are listed in the below table.

Parameter

System Configuration

GPU configuration

GPUs are balanced across CPU sockets and root ports.

  • Inference servers: 2x, 4x, and 8x GPUs per server

  • Training and DL servers: Minimum 8 GPUs per server

See the topology diagram above for details

NVLink Interconnect

H200 NVL supports NVL4 and NVL2 bridges. Pairing of GPU cards under the same CPU socket is best; pairing of GPU cards under different CPU sockets is acceptable but not recommended. See topology diagram above for NVLink bridging recommendations.

CPU

Intel Emerald Rapids, Intel Sapphire Rapids, Intel Granite Rapids and Intel Sierra Forest AMD Genoa and AMD Turin

CPU sockets

Two CPU sockets minimum

CPU speed

2.0 GHz minimum CPU clock

CPU cores

Minimum 7 physical CPU cores per GPU

  • For configuration using MIG, 2 CPU cores required per MIG instance

  • For OS kernel or virtualization, additional two cores per GPU

System memory (total across all CPU sockets)

Minimum 128 GB of system memory per GPU

DPU

One NVIDIA® BlueField®-3 DPU per serve

PCI Express

One Gen5 x16 link per maximum two GPUs. Recommend one Gen5 x16 link per GPU

PCIe topology

Balanced PCIe topology with GPUs spread evenly across CPU sockets and PCIe root ports. NIC and NVMe drives should be under the same PCIe switch or PCIe root complex as the GPUs. Note that a PCIe switch may not be needed for low-cost inference servers; direct-attach to CPU is best if possible. See the topology diagram above for details

PCIe switches

Gen5 PCIe switches as needed (where additional link fanout is not required, direct attach is best).

Compute (E-W) NIC

Four NVIDIA® BlueField®-3 SuperNICs per server Up to 400 Gbps

Local storage

Local storage recommendations are as follows:

  • Inference Servers: Minimum 1 TB NVMe drive per CPU socket.

  • Training / DL Servers: Minimum 2 TB NVMe drive per CPU socket.

  • HPC Servers: Minimum 1 TB NVMe drive per CPU socket

Remote systems management

SMBPBI over SMBus (OOB) protocol to BMC PLDM T5-enabled. SPDM-enabled

Security

TPM 2.0 module (secure boot)