Compute Node Hardware#

The Software Reference Architecture is comprised of individually optimized NVIDIA-Certified System servers that follow a prescriptive design pattern to ensure optimal performance when deployed in a cluster environment. There are currently three types of server configurations for which Enterprise RAs are designed: PCIe Optimized 2-4-3, PCIe Optimized 2-8-5, and HGX systems. For the PCIe Optimized configurations (for example, 2-8-5), the respective digits refer to the number of sockets (CPUs), the number of GPUs, and the number of network adapters. For Further details on Enterprise RA designs, refer to the NVIDIA Enterprise Reference Architecture Overview Whitepaper.

H200 NVL Systems#

The NVIDIA AI Enterprise RA leverages an H200 NVL PCIe Optimized 2-8-5 reference configuration. Additional detailed reference configurations for 2-4-3 & 2-8-9 HGX systems can be made available via request. Additional types of systems with L40S, L40, L4, H100/H200 can also be used and will have different configurations

Note

The NVIDIA H200 NVL includes an NVIDIA AI Enterprise License and can be activated through NGC. Not all supported GPUs include an NVIDIA AI Enterprise License.

Diagram of PCIe Optimized 2-8-5 configuration with NVIDIA H200 NVL.

Components for the H200 NVL NVIDIA-Certified system are listed in the below table.

Parameter	System Configuration
GPU configuration	GPUs are balanced across CPU sockets and root ports. Inference servers: 2x, 4x, and 8x GPUs per server Training and DL servers: Minimum 8 GPUs per server See the topology diagram above for details
NVLink Interconnect	H200 NVL supports NVL4 and NVL2 bridges. Pairing of GPU cards under the same CPU socket is best; pairing of GPU cards under different CPU sockets is acceptable but not recommended. See topology diagram above for NVLink bridging recommendations.
CPU	Intel Emerald Rapids, Intel Sapphire Rapids, Intel Granite Rapids and Intel Sierra Forest AMD Genoa and AMD Turin
CPU sockets	Two CPU sockets minimum
CPU speed	2.0 GHz minimum CPU clock
CPU cores	Minimum 7 physical CPU cores per GPU For configuration using MIG, 2 CPU cores required per MIG instance For OS kernel or virtualization, additional two cores per GPU
System memory (total across all CPU sockets)	Minimum 128 GB of system memory per GPU
DPU	One NVIDIA® BlueField®-3 DPU per serve
PCI Express	One Gen5 x16 link per maximum two GPUs. Recommend one Gen5 x16 link per GPU
PCIe topology	Balanced PCIe topology with GPUs spread evenly across CPU sockets and PCIe root ports. NIC and NVMe drives should be under the same PCIe switch or PCIe root complex as the GPUs. Note that a PCIe switch may not be needed for low-cost inference servers; direct-attach to CPU is best if possible. See the topology diagram above for details
PCIe switches	Gen5 PCIe switches as needed (where additional link fanout is not required, direct attach is best).
Compute (E-W) NIC	Four NVIDIA® BlueField®-3 SuperNICs per server Up to 400 Gbps
Local storage	Local storage recommendations are as follows: Inference Servers: Minimum 1 TB NVMe drive per CPU socket. Training / DL Servers: Minimum 2 TB NVMe drive per CPU socket. HPC Servers: Minimum 1 TB NVMe drive per CPU socket
Remote systems management	SMBPBI over SMBus (OOB) protocol to BMC PLDM T5-enabled. SPDM-enabled
Security	TPM 2.0 module (secure boot)