Reference Build

Reference Architecture (0.1.0)

For NVIDIA AI Enterprise, a cluster of NVIDIA-Certified Systems with a minimum of four nodes is recommended. This cluster size is the minimum viable size since it offers a balanced approach with NVIDIA GPUs and NVIDIA ConnectX-6 networking for various workloads. The cluster can also be expanded with additional nodes as needed.

The following sections outline an example deployment and the specifications used for each of the four nodes within the cluster. Overall, each node has identical hardware and software specifications.

Note

For guidance on specific hardware recommendations, such as rack configurations, sizing for power, networking, and storage for your deployment, please refer to the NVIDIA-Certified Sizing Guide.

An NVIDIA-Certified System contains powerful NVIDIA GPUs and networking, which offers maximum performance per node. By adding high-performance NVIDIA Mellanox Networking, performance gains can be achieved when executing multi-node AI Enterprise workloads. The following table describes the hardware configuration for each of the nodes within the cluster.

EGX Node Configuration

Server Model 2U NVIDIA-Certified System
CPU Dual Intel® Xeon® Gold 6240R 2.4G, 24C/48T
RAM 12 x 64GB RDIMM, 3200MT/s, Dual Rank
Storage 1 x 446GB SSD SATA Mix Use 6Gbps
Storage 1 x 6TB Enterprise NVMe
Network Onboard networking
Power Dual, Hot-plug, Redundant Power Supply (1+1), 1600W
Network 1 x NVIDIA® Mellanox® ConnectX®-6 Dx 100G
GPU 1 x NVIDIA A100 for PCIe
Note

The NVIDIA GPU and the NVIDIA Mellanox NIC are deployed in each node, ensuring that they are on the same NUMA Domain and PCIe Root Complex. For more information on this configuration, please see the Multi-Node Training Deployment Guide.

The figure below illustrates the network topology used for the cluster. As shown, there are two network switches. The Management switch hosts all mgmt and VM traffic and is the infrastructure Top of Rack switch. The gpu-leaf-01 is the high-performance 100G NVIDIA Mellanox Networking switch and provides more throughput between the nodes resulting in performance gains when executing multi-node AI workloads.

referencebuild-01.png

Each VM has dual network connections. One network connection is used for management access via the vSwitch. The second network connection is for the Workload Network via PCI passthrough of the NVIDIA ConnectX-6-Dx. The following graphic illustrates the VM Network Configuration.

referencebuild-02.png

The following table describes the network hardware configuration further:

Network Hardware Configuration

Workload Switch (gpu-leaf01) ManagementSwitch (mgmt-leaf01) vSwitch0
Make/Model NVIDIA MSN3700-CS2F Generic Vendor VMware
Switch OS Cumulus Linux 4.4 N/A Standard
Port Speed 100G 10G N/A
Port MTU 9216 1500 1500
Port Mode Access Access ManagementN Network - Access
Port VLAN 111 805 805

Remote Direct Memory Access (RDMA) With Address Translation Services (ATS)

Deep Learning training workflows can benefit from RDMA with ATS for executing high-performance multi-node training. RDMA allows direct memory access from the memory of one computer to the memory of another computer without involving the operating system or CPU. NVIDIA ConnectX network adapters are certified for RDMA on VMware vSphere.

The switch and NICs are configured to support RDMA over Converged Ethernet (RoCE) in this deployment.

Note

Please refer to Reference Architecture appendix for more details.

The following table provides an overview of the ESXi configuration:

ESXi Configuration

Version 7.0 U2 (VMware ESXi, 7.0.2, 17538105)
vGPU Host Driver 13.0
Device Configuration NVIDIA NICs passthrough, A100s vGPU
ATS Enabled

ESXi 7.0.2 was installed on each of the 4 nodes within our cluster. The NVIDIA vGPU software was then installed on each of the nodes. Using vCenter, a cluster was created and the 4 nodes were added to the cluster. ATS was enabled on all 4 servers and then a local datastore was created on each of the NVMe drives for each node.

Note

Many data scientists are looking to scale compute resources to reduce the time it takes to complete the training of a neural network and produce results in real-time. Taking a multi-node approach brings scientists closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms. To maximize the throughput, the workload topology for our cluster took the multi-node for Deep Learning Training workloads, which uses one VM per node, with each VM assigned a full GPU using the A100-40C vGPU profile.

Compute workloads can also benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. To lower the Total Cost of Ownership (TCO) and overall efficiency, we used GPU partitioning for Deep Learning Inferencing workloads. GPUs can be partitioned using either NVIDIA vGPU software temporal partitioning or Multi-Instance GPU (MIG) spatial partitioning. Use cases that require high quality of service with low latency response and error isolation are key workloads for MIG spatial partitioning. With this in mind, we reconfigured the cluster for Deep Learning Inferencing workloads to leverage MIG Mode.

Note

Please refer to the GPU Partitioning Technical Brief to understand the differences between types of GPU partitioning. For more information on how to deploy MIG partitions, please refer to the NVIDIA AI Enterprise Deployment Guide.

Deep Learning Training

In this example, multi-node Deep Learning Training workflows use a high-performance multi-node cluster of VMs. A single VM is deployed on each of the servers in the four node cluster, resulting in 4 VMs in our cluster. The following tables describe the VM configuration and software used for all four VMs.

VM Configuration for Multi-node Training

vCPUs 16 cores (Multi-Node Training Deployment Guide)
Memory 64 GB
Storage 800 GB thin provisioned virtual disk on local NVMe datastore
GPU Profile grid_a100-40c
Management Network VMXNet3 NIC connected to network
Workload Network 1 x NVIDIA® Mellanox® Connect X6 DX PCI Devices
Advanced Configuration
64-bit MMIO pciPassthru.use64bitMMIO=”TRUE”
MMIO Space pciPassthru.64bitMMIOSizeGB = “128”
P2P pciPassthru.allowP2P=true
relaxACS pciPassthru.RelaxACSforP2P=true
NUMA Affinity numa.nodeAffinity (Multi-Node Training Deployment Guide)

Software Configurations for Multi-Node Training

OS Ubuntu 20.04.2 LTS
vGPU Driver 460.32.03
OFED MLNX_OFED_LINUX-5.0-2.1.8.0
Docker 20.10
NVIDIA Container Toolkit nvidia-docker2
Container nvcr.io/nvaie/tensorflow:21.07-tf1-py3
TensorFlow 1.15.5
Model Version ResNet-50

The following graph illustrates the performance of a multi-node training run on this cluster for an object detection workload using Horovod, which can leverage features of high-performance networks such as RDMA over Converged Ethernet (RoCE). Training an AI model can be incredibly data-intensive and requires scale-out performance across multiple GPUs within the cluster.

referencebuild-03.png

For details on setting up and running multi-node training workloads, refer to the Multi-Node Training Deployment Guide.

Deep Learning Inferencing

In this example, GPU MIG partitioning is used for Deep Learning Inferencing workloads. Multiple VMs are deployed on each of the servers in the four-node cluster. Each of the A100 GPUs has MIG mode enabled, and a single GPU is shared between 2 VMs, using an A100-3-20C MIG vGPU profile.

The following tables describe the VM configuration and software used for Deep Learning Inferencing VMs.

VM Configuration for Inference

vCPUs 16 cores
Memory 64 GB
Storage 800 GB thin provisioned virtual disk on local NVMe datastore
GPU Profile grid_a100-3-20c, grid_a100-7-40c for benchmark purposes
Management Network VMXNet3 NIC connected to network
Advanced Configuration
64-bit MMIO pciPassthru.use64bitMMIO=”TRUE”
MMIO Space pciPassthru.64bitMMIOSizeGB = “128”

Software configurations for Inference

OS Ubuntu 20.04.2 LTS
vGPU Driver 460.63.01
NCCL Version 2.7.8
TensorRT 7.2.2 (Built from source)
Dataset ResNet-50

The graph below illustrates the performance of a ResNet-50 object detection inference workload running on a VM without a GPU, VMs with vGPU using MIG, and a Bare Metal server with an A100. This workload uses NVIDIA Triton Inference Server, one of the enterprise-grade AI tools and frameworks optimized, certified, and supported by NVIDIA to run on VMware vSphere.

referencebuild-04.png

For details on how to set up and run Inferencing workloads using Triton Inference Server, refer to the Triton Inference Server deployment guide.

Previous NVIDIA + VMware AI-Ready Platform Components
Next Networking
© Copyright 2024, NVIDIA. Last updated on Apr 2, 2024.