Reference Build#
For NVIDIA AI Enterprise, a cluster of NVIDIA-Certified Systems with a minimum of four nodes is recommended. This cluster size is the minimum viable size since it offers a balanced approach with NVIDIA GPUs and NVIDIA ConnectX-6 networking for various workloads. The cluster can also be expanded with additional nodes as needed.
The following sections outline an example deployment and the specifications used for each of the four nodes within the cluster. Overall, each node has identical hardware and software specifications.
Note
For guidance on specific hardware recommendations, such as rack configurations, sizing for power, networking, and storage for your deployment, please refer to the NVIDIA-Certified Sizing Guide.
Node Hardware#
An NVIDIA-Certified System contains powerful NVIDIA GPUs and networking, which offers maximum performance per node. By adding high-performance NVIDIA Mellanox Networking, performance gains can be achieved when executing multi-node AI Enterprise workloads. The following table describes the hardware configuration for each of the nodes within the cluster.
EGX Node Configuration |
|
---|---|
Server Model |
2U NVIDIA-Certified System |
CPU |
Dual Intel® Xeon® Gold 6240R 2.4G, 24C/48T |
RAM |
12 x 64GB RDIMM, 3200MT/s, Dual Rank |
Storage |
1 x 446GB SSD SATA Mix Use 6Gbps |
Storage |
1 x 6TB Enterprise NVMe |
Network |
Onboard networking |
Power |
Dual, Hot-plug, Redundant Power Supply (1+1), 1600W |
Network |
1 x NVIDIA® Mellanox® ConnectX®-6 Dx 100G |
GPU |
1 x NVIDIA A100 for PCIe |
Note
The NVIDIA GPU and the NVIDIA Mellanox NIC are deployed in each node, ensuring that they are on the same NUMA Domain and PCIe Root Complex. For more information on this configuration, please see the Multi-Node Training Deployment Guide.
Network Topology Overview#
The figure below illustrates the network topology used for the cluster. As shown, there are two network switches. The Management switch hosts all mgmt and VM traffic and is the infrastructure Top of Rack switch. The gpu-leaf-01 is the high-performance 100G NVIDIA Mellanox Networking switch and provides more throughput between the nodes resulting in performance gains when executing multi-node AI workloads.
Each VM has dual network connections. One network connection is used for management access via the vSwitch. The second network connection is for the Workload Network via PCI passthrough of the NVIDIA ConnectX-6-Dx. The following graphic illustrates the VM Network Configuration.
The following table describes the network hardware configuration further:
Network Hardware Configuration |
|||
---|---|---|---|
Workload Switch ( |
ManagementSwitch ( |
vSwitch0 |
|
Make/Model |
NVIDIA MSN3700-CS2F |
Generic Vendor |
VMware |
Switch OS |
Cumulus Linux 4.4 |
N/A |
Standard |
Port Speed |
100G |
10G |
N/A |
Port MTU |
9216 |
1500 |
1500 |
Port Mode |
Access |
Access |
ManagementN Network - Access |
Port VLAN |
111 |
805 |
805 |
Remote Direct Memory Access (RDMA) With Address Translation Services (ATS)#
Deep Learning training workflows can benefit from RDMA with ATS for executing high-performance multi-node training. RDMA allows direct memory access from the memory of one computer to the memory of another computer without involving the operating system or CPU. NVIDIA ConnectX network adapters are certified for RDMA on VMware vSphere.
The switch and NICs are configured to support RDMA over Converged Ethernet (RoCE) in this deployment.
Note
Please refer to Reference Architecture appendix for more details.
VMware ESXi Deployment Topology#
The following table provides an overview of the ESXi configuration:
ESXi Configuration |
|
---|---|
Version |
7.0 U2 (VMware ESXi, 7.0.2, 17538105) |
vGPU Host Driver |
13.0 |
Device Configuration |
NVIDIA NICs passthrough, A100s vGPU |
ATS |
Enabled |
ESXi 7.0.2 was installed on each of the 4 nodes within our cluster. The NVIDIA vGPU software was then installed on each of the nodes. Using vCenter, a cluster was created and the 4 nodes were added to the cluster. ATS was enabled on all 4 servers and then a local datastore was created on each of the NVMe drives for each node.
Note
The installation of VMware ESXi and the NVIDIA vGPU Host and Guest Driver Software is out of the scope of this document. Please refer to the NVIDIA AI Enterprise Deployment Guide for detailed instructions.
Refer to the NVIDIA AI Enterprise Deployment Guide for steps to enable ATS.
Virtual Machine and Workload Topology#
Many data scientists are looking to scale compute resources to reduce the time it takes to complete the training of a neural network and produce results in real-time. Taking a multi-node approach brings scientists closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms. To maximize the throughput, the workload topology for our cluster took the multi-node for Deep Learning Training workloads, which uses one VM per node, with each VM assigned a full GPU using the A100-40C vGPU profile.
Compute workloads can also benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. To lower the Total Cost of Ownership (TCO) and overall efficiency, we used GPU partitioning for Deep Learning Inferencing workloads. GPUs can be partitioned using either NVIDIA vGPU software temporal partitioning or Multi-Instance GPU (MIG) spatial partitioning. Use cases that require high quality of service with low latency response and error isolation are key workloads for MIG spatial partitioning. With this in mind, we reconfigured the cluster for Deep Learning Inferencing workloads to leverage MIG Mode.
Note
Please refer to the GPU Partitioning Technical Brief to understand the differences between types of GPU partitioning. For more information on how to deploy MIG partitions, please refer to the NVIDIA AI Enterprise Deployment Guide.
Deep Learning Training#
In this example, multi-node Deep Learning Training workflows use a high-performance multi-node cluster of VMs. A single VM is deployed on each of the servers in the four node cluster, resulting in 4 VMs in our cluster. The following tables describe the VM configuration and software used for all four VMs.
VM Configuration for Multi-node Training |
|
---|---|
vCPUs |
16 cores (Multi-Node Training Deployment Guide) |
Memory |
64 GB |
Storage |
800 GB thin provisioned virtual disk on local NVMe datastore |
GPU Profile |
grid_a100-40c |
Management Network |
VMXNet3 NIC connected to network |
Workload Network |
1 x NVIDIA® Mellanox® Connect X6 DX PCI Devices |
Advanced Configuration |
|
64-bit MMIO |
pciPassthru.use64bitMMIO=”TRUE” |
MMIO Space |
pciPassthru.64bitMMIOSizeGB = “128” |
P2P |
pciPassthru.allowP2P=true |
relaxACS |
pciPassthru.RelaxACSforP2P=true |
NUMA Affinity |
numa.nodeAffinity (Multi-Node Training Deployment Guide) |
Software Configurations for Multi-Node Training |
|
---|---|
OS |
Ubuntu 20.04.2 LTS |
vGPU Driver |
460.32.03 |
OFED |
MLNX_OFED_LINUX-5.0-2.1.8.0 |
Docker |
20.10 |
NVIDIA Container Toolkit |
nvidia-docker2 |
Container |
nvcr.io/nvaie/tensorflow:21.07-tf1-py3 |
TensorFlow |
1.15.5 |
Model Version |
ResNet-50 |
The following graph illustrates the performance of a multi-node training run on this cluster for an object detection workload using Horovod, which can leverage features of high-performance networks such as RDMA over Converged Ethernet (RoCE). Training an AI model can be incredibly data-intensive and requires scale-out performance across multiple GPUs within the cluster.
For details on setting up and running multi-node training workloads, refer to the Multi-Node Training Deployment Guide.
Deep Learning Inferencing#
In this example, GPU MIG partitioning is used for Deep Learning Inferencing workloads. Multiple VMs are deployed on each of the servers in the four-node cluster. Each of the A100 GPUs has MIG mode enabled, and a single GPU is shared between 2 VMs, using an A100-3-20C MIG vGPU profile.
The following tables describe the VM configuration and software used for Deep Learning Inferencing VMs.
VM Configuration for Inference |
|
---|---|
vCPUs |
16 cores |
Memory |
64 GB |
Storage |
800 GB thin provisioned virtual disk on local NVMe datastore |
GPU Profile |
grid_a100-3-20c, grid_a100-7-40c for benchmark purposes |
Management Network |
VMXNet3 NIC connected to network |
Advanced Configuration |
|
64-bit MMIO |
pciPassthru.use64bitMMIO=”TRUE” |
MMIO Space |
pciPassthru.64bitMMIOSizeGB = “128” |
Software configurations for Inference |
|
---|---|
OS |
Ubuntu 20.04.2 LTS |
vGPU Driver |
460.63.01 |
NCCL Version |
2.7.8 |
TensorRT |
7.2.2 (Built from source) |
Dataset |
ResNet-50 |
The graph below illustrates the performance of a ResNet-50 object detection inference workload running on a VM without a GPU, VMs with vGPU using MIG, and a Bare Metal server with an A100. This workload uses NVIDIA Triton Inference Server, one of the enterprise-grade AI tools and frameworks optimized, certified, and supported by NVIDIA to run on VMware vSphere.
For details on how to set up and run Inferencing workloads using Triton Inference Server, refer to the Triton Inference Server deployment guide.