Reference Build

For NVIDIA AI Enterprise, a cluster of NVIDIA-Certified Systems with a minimum of four nodes is recommended. This cluster size is the minimum viable size since it offers a balanced approach with NVIDIA GPUs and NVIDIA ConnectX-6 networking for various workloads. The cluster can also be expanded with additional nodes as needed.

The following sections outline an example deployment and the specifications used for each of the four nodes within the cluster. Overall, each node has identical hardware and software specifications.

Note

For guidance on specific hardware recommendations, such as rack configurations, sizing for power, networking, and storage for your deployment, please refer to the NVIDIA-Certified Sizing Guide.

An NVIDIA-Certified System contains powerful NVIDIA GPUs and networking, which offers maximum performance per node. By adding high-performance NVIDIA Mellanox Networking, performance gains can be achieved when executing multi-node AI Enterprise workloads. The following table describes the hardware configuration for each of the nodes within the cluster.

EGX Node Configuration

Server Model

2U NVIDIA-Certified System

CPU

Dual Intel® Xeon® Gold 6240R 2.4G, 24C/48T

RAM

12 x 64GB RDIMM, 3200MT/s, Dual Rank

Storage

1 x 446GB SSD SATA Mix Use 6Gbps

Storage

1 x 6TB Enterprise NVMe

Network

Onboard networking

Power

Dual, Hot-plug, Redundant Power Supply (1+1), 1600W

Network

1 x NVIDIA® Mellanox® ConnectX®-6 Dx 100G

GPU

1 x NVIDIA A100 for PCIe

Note

The NVIDIA GPU and the NVIDIA Mellanox NIC are deployed in each node, ensuring that they are on the same NUMA Domain and PCIe Root Complex. For more information on this configuration, please see the Multi-Node Training Deployment Guide.

The figure below illustrates the network topology used for the cluster. As shown, there are two network switches. The Management switch hosts all mgmt and VM traffic and is the infrastructure Top of Rack switch. The gpu-leaf-01 is the high-performance 100G NVIDIA Mellanox Networking switch and provides more throughput between the nodes resulting in performance gains when executing multi-node AI workloads.

referencebuild-01.png

Each VM has dual network connections. One network connection is used for management access via the vSwitch. The second network connection is for the Workload Network via PCI passthrough of the NVIDIA ConnectX-6-Dx. The following graphic illustrates the VM Network Configuration.

referencebuild-02.png

The following table describes the network hardware configuration further:

Network Hardware Configuration

Workload Switch (gpu-leaf01)

ManagementSwitch (mgmt-leaf01)

vSwitch0

Make/Model

NVIDIA MSN3700-CS2F

Generic Vendor

VMware

Switch OS

Cumulus Linux 4.4

N/A

Standard

Port Speed

100G

10G

N/A

Port MTU

9216

1500

1500

Port Mode

Access

Access

ManagementN Network - Access

Port VLAN

111

805

805

Remote Direct Memory Access (RDMA) With Address Translation Services (ATS)

Deep Learning training workflows can benefit from RDMA with ATS for executing high-performance multi-node training. RDMA allows direct memory access from the memory of one computer to the memory of another computer without involving the operating system or CPU. NVIDIA ConnectX network adapters are certified for RDMA on VMware vSphere.

The switch and NICs are configured to support RDMA over Converged Ethernet (RoCE) in this deployment.

Note

Please refer to Reference Architecture appendix for more details.

The following table provides an overview of the ESXi configuration:

ESXi Configuration

Version

7.0 U2 (VMware ESXi, 7.0.2, 17538105)

vGPU Host Driver

13.0

Device Configuration

NVIDIA NICs passthrough, A100s vGPU

ATS

Enabled

ESXi 7.0.2 was installed on each of the 4 nodes within our cluster. The NVIDIA vGPU software was then installed on each of the nodes. Using vCenter, a cluster was created and the 4 nodes were added to the cluster. ATS was enabled on all 4 servers and then a local datastore was created on each of the NVMe drives for each node.

Note

Many data scientists are looking to scale compute resources to reduce the time it takes to complete the training of a neural network and produce results in real-time. Taking a multi-node approach brings scientists closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms. To maximize the throughput, the workload topology for our cluster took the multi-node for Deep Learning Training workloads, which uses one VM per node, with each VM assigned a full GPU using the A100-40C vGPU profile.

Compute workloads can also benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. To lower the Total Cost of Ownership (TCO) and overall efficiency, we used GPU partitioning for Deep Learning Inferencing workloads. GPUs can be partitioned using either NVIDIA vGPU software temporal partitioning or Multi-Instance GPU (MIG) spatial partitioning. Use cases that require high quality of service with low latency response and error isolation are key workloads for MIG spatial partitioning. With this in mind, we reconfigured the cluster for Deep Learning Inferencing workloads to leverage MIG Mode.

Note

Please refer to the GPU Partitioning Technical Brief to understand the differences between types of GPU partitioning. For more information on how to deploy MIG partitions, please refer to the NVIDIA AI Enterprise Deployment Guide.

Deep Learning Training

In this example, multi-node Deep Learning Training workflows use a high-performance multi-node cluster of VMs. A single VM is deployed on each of the servers in the four node cluster, resulting in 4 VMs in our cluster. The following tables describe the VM configuration and software used for all four VMs.

VM Configuration for Multi-node Training

vCPUs

16 cores (Multi-Node Training Deployment Guide)

Memory

64 GB

Storage

800 GB thin provisioned virtual disk on local NVMe datastore

GPU Profile

grid_a100-40c

Management Network

VMXNet3 NIC connected to network

Workload Network

1 x NVIDIA® Mellanox® Connect X6 DX PCI Devices

Advanced Configuration

64-bit MMIO

pciPassthru.use64bitMMIO=”TRUE”

MMIO Space

pciPassthru.64bitMMIOSizeGB = “128”

P2P

pciPassthru.allowP2P=true

relaxACS

pciPassthru.RelaxACSforP2P=true

NUMA Affinity

numa.nodeAffinity (Multi-Node Training Deployment Guide)

Software Configurations for Multi-Node Training

OS

Ubuntu 20.04.2 LTS

vGPU Driver

460.32.03

OFED

MLNX_OFED_LINUX-5.0-2.1.8.0

Docker

20.10

NVIDIA Container Toolkit

nvidia-docker2

Container

nvcr.io/nvaie/tensorflow:21.07-tf1-py3

TensorFlow

1.15.5

Model Version

ResNet-50

The following graph illustrates the performance of a multi-node training run on this cluster for an object detection workload using Horovod, which can leverage features of high-performance networks such as RDMA over Converged Ethernet (RoCE). Training an AI model can be incredibly data-intensive and requires scale-out performance across multiple GPUs within the cluster.

referencebuild-03.png

For details on setting up and running multi-node training workloads, refer to the Multi-Node Training Deployment Guide.

Deep Learning Inferencing

In this example, GPU MIG partitioning is used for Deep Learning Inferencing workloads. Multiple VMs are deployed on each of the servers in the four-node cluster. Each of the A100 GPUs has MIG mode enabled, and a single GPU is shared between 2 VMs, using an A100-3-20C MIG vGPU profile.

The following tables describe the VM configuration and software used for Deep Learning Inferencing VMs.

VM Configuration for Inference

vCPUs

16 cores

Memory

64 GB

Storage

800 GB thin provisioned virtual disk on local NVMe datastore

GPU Profile

grid_a100-3-20c, grid_a100-7-40c for benchmark purposes

Management Network

VMXNet3 NIC connected to network

Advanced Configuration

64-bit MMIO

pciPassthru.use64bitMMIO=”TRUE”

MMIO Space

pciPassthru.64bitMMIOSizeGB = “128”

Software configurations for Inference

OS

Ubuntu 20.04.2 LTS

vGPU Driver

460.63.01

NCCL Version

2.7.8

TensorRT

7.2.2 (Built from source)

Dataset

ResNet-50

The graph below illustrates the performance of a ResNet-50 object detection inference workload running on a VM without a GPU, VMs with vGPU using MIG, and a Bare Metal server with an A100. This workload uses NVIDIA Triton Inference Server, one of the enterprise-grade AI tools and frameworks optimized, certified, and supported by NVIDIA to run on VMware vSphere.

referencebuild-04.png

For details on how to set up and run Inferencing workloads using Triton Inference Server, refer to the Triton Inference Server deployment guide.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 23, 2023.