This sizing guide is intended to guide customers who want to implement NVIDIA AI Enterprise with NVIDIA-Certified Systems at scale. For NVIDIA AI Enterprise, a cluster of NVIDIA-Certified Systems with a minimum of four nodes is recommended. This cluster size is the minimum viable size since it offers a balanced approach with NVIDIA GPUs and NVIDIA ConnectX-6 networking for various workloads. The cluster can also be expanded with additional nodes as needed.
Topics such as general rack-level configuration, sizing for power, networking, and storage will be discussed. These topics will focus on the NVIDIA-Certified Systems’s specifications for three different configuration levels: Entry, Mainstream, and Best Performance. Server configurations improve from Entry to Best Performance, and these configurations build upon one another.
The benchmarks used within this sizing guide are not all-encompassing; they provide a representative workflow and serve as a starting point that can be used to build upon depending on your environment. The analysis of rack density and power requirements used for this sizing guide primarily focuses on the AI Enterprise use cases. Our sizing specifically concentrates on a multi-node training workload since this use case fully saturates GPU resources and power requirements while demonstrating linear scale-out performance. Each configuration (Entry, Mainstream, and Best Performance) uses a four-node cluster for the following Deep Learning training workflow:
Tensorflow ResNet-50 V1.5 Training using Horovod, FP16, BS:512
Each node within the cluster is a virtual machine (VM) and is configured to use NVIDIA vGPU technology, using the entire 1:1 vGPU profile. Please refer to Sizing Guide Appendix for additional VM and server configuration information.
Server Manufacture infrastructure planning tools are used to calculate rack density. These tools are available online and serve as a resource for helping IT professionals determine actual, real-world power requirements.
The sizing calculations and recommendations are meant to guide and serve as a starting point that can be used to build upon depending on your environment. The primary goal of our sizing calculations was not to increase the power requirements of a data center but rather to work within the typical power capacity of a rack in a somewhat modern data center. Sizing calculations are based upon a 14kW redundant PDU per rack, and a Dual 1600W PSU per server since most enterprise data centers have these requirements. Due to these power requirements, Mainstream and Best Performance configurations resulted in fewer GPU nodes per rack than CPU-only nodes; however, GPU accelerated nodes provided greater performance (images per second) when executing ResNet-50 workload benchmarks.
The following paragraphs describe an overview of sizing. Each configuration will be discussed in further detail within the remaining sections of this document.
The Entry configuration for an NVIDIA AI Enterprise cluster can be quickly deployed into an existing data center without significant adjustments to the environment. This configuration keeps the same footprint as current 2U server nodes by leveraging existing networking infrastructure and existing storage. The Entry configuration provides a balance of performance and cost and provides up to 20x performance improvement per rack for AI training workloads compared to CPU only.
The Mainstream configuration builds upon the Entry configuration by increasing storage, networking, and GPU resources. This configuration produces faster results when executing multi-node training and inferencing jobs. In general, the Mainstream configuration is the most suitable for enterprises since it provides higher-end server specifications which offer optimized performance for mixed workloads with up to 30x performance improvement at the rack level. Further details regarding an example deployment for a Mainstream configuration, which are beyond the scope of this sizing document, are provided within the reference architecture for NVIDIA AI Enterprise on VMware vSphere.
The Best Performance configuration, which builds upon the Mainstream configuration, further increases GPU capabilities and networking density. This configuration allows for more scale-up and scale-out capabilities, which further increases throughput when running training and inference workflows, with an incremental performance improvement per rack of 44x compared to CPU only.