Compute Node Hardware#
The Software Reference Architecture is comprised of individually optimized NVIDIA-Certified System servers that follow a prescriptive design pattern to ensure optimal performance when deployed in a cluster environment. There are currently three types of server configurations for which Enterprise RAs are designed: PCIe Optimized 2-4-3, PCIe Optimized 2-8-5, and HGX systems. For the PCIe Optimized configurations (for example, 2-8-5), the respective digits refer to the number of sockets (CPUs), the number of GPUs, and the number of network adapters. For Further details on Enterprise RA designs, refer to the NVIDIA Enterprise Reference Architecture Overview Whitepaper.
H200 NVL Systems#
The NVIDIA AI Enterprise RA leverages an H200 NVL PCIe Optimized 2-8-5 reference configuration. Additional detailed reference configurations for 2-4-3 & 2-8-9 HGX systems can be made available via request. Additional types of systems with L40S, L40, L4, H100/H200 can also be used and will have different configurations
Note
The NVIDIA H200 NVL includes an NVIDIA AI Enterprise License and can be activated through NGC. Not all supported GPUs include an NVIDIA AI Enterprise License.
Diagram of PCIe Optimized 2-8-5 configuration with NVIDIA H200 NVL.
Components for the H200 NVL NVIDIA-Certified system are listed in the below table.
Parameter |
System Configuration |
---|---|
GPU configuration |
GPUs are balanced across CPU sockets and root ports.
See the topology diagram above for details |
NVLink Interconnect |
H200 NVL supports NVL4 and NVL2 bridges. Pairing of GPU cards under the same CPU socket is best; pairing of GPU cards under different CPU sockets is acceptable but not recommended. See topology diagram above for NVLink bridging recommendations. |
CPU |
Intel Emerald Rapids, Intel Sapphire Rapids, Intel Granite Rapids and Intel Sierra Forest AMD Genoa and AMD Turin |
CPU sockets |
Two CPU sockets minimum |
CPU speed |
2.0 GHz minimum CPU clock |
CPU cores |
Minimum 7 physical CPU cores per GPU
|
System memory (total across all CPU sockets) |
Minimum 128 GB of system memory per GPU |
DPU |
One NVIDIA® BlueField®-3 DPU per serve |
PCI Express |
One Gen5 x16 link per maximum two GPUs. Recommend one Gen5 x16 link per GPU |
PCIe topology |
Balanced PCIe topology with GPUs spread evenly across CPU sockets and PCIe root ports. NIC and NVMe drives should be under the same PCIe switch or PCIe root complex as the GPUs. Note that a PCIe switch may not be needed for low-cost inference servers; direct-attach to CPU is best if possible. See the topology diagram above for details |
PCIe switches |
Gen5 PCIe switches as needed (where additional link fanout is not required, direct attach is best). |
Compute (E-W) NIC |
Four NVIDIA® BlueField®-3 SuperNICs per server Up to 400 Gbps |
Local storage |
Local storage recommendations are as follows:
|
Remote systems management |
SMBPBI over SMBus (OOB) protocol to BMC PLDM T5-enabled. SPDM-enabled |
Security |
TPM 2.0 module (secure boot) |