Sizing Guide

To calculate performance improvement of the entire rack, 4-node-clusters ran simultaneously, creating a linear scaling model for the entire rack. The predicted performance improvements for each configuration are based on the CPU-only results and were linearly extrapolated to a full CPU-only rack with 20 nodes consisting of five 4-node-clusters.

It is important to note that due to power requirements, rack density calculations resulted in fewer GPU nodes per rack than CPU-only nodes per rack since most enterprise data centers require a 14kW redundant PDU per rack and a dual 1600W PSU per server. The relative performance numbers shown below assume these hard power limits. Regardless, there are significant performance improvements when using GPU nodes instead of CPU-only nodes. Enterprises that can go above these rack power limits may have an even more considerable amount of performance gains.

Entry Performance Cluster

For the Entry cluster configuration, 20 physical nodes fit into the rack allowing for five 4-node-clusters. This configuration uses an A30, which is expected to produce approximately 60% of the throughput compared to the A100. These assumptions are used to calculate expected performance compared to the CPU-only rack scenario mentioned above. The expected performance improvement for the Entry configuration compared to a CPU-only rack is 20X.

CPU-only tests were assigned the amount of 75% more vCPU resources compared to a vGPU accelerated VM. You may see reduced performance improvements if you have a processor with a higher core count and clock speed.

Mainstream Performance Cluster

For the Mainstream configuration, 15 physical nodes fit into the rack resulting in 3.75 4-node-clusters in the rack. Fractional 4-node-clusters can be used because performance improvements are shown to be linear at this scale. This configuration also utilizes RoCE with ATS via an NVIDIA Connect-x6, allowing for further performance improvement on a 100GbE network. These assumptions are used to calculate expected performance compared to the CPU-only rack scenario mentioned above. The expected performance improvement for the Mainstream configuration compared to a CPU-only rack is 30X.

Best Performance Cluster

Eleven physical nodes fit into the rack for the Best Performance configuration, resulting in 2.75 4-node-clusters in the rack. Fractional 4-node-clusters could be used because performance improvements are shown to be linear at this scale. This configuration also utilizes RoCE with ATS via an NVIDIA Connect-x6, allowing for further performance improvement on a 100GbE network. Additionally, 2x A100 GPUs are placed in each server, allowing for 2 VMs per server, doubling the number of virtual nodes per rack. These assumptions are used to calculate expected performance compared to the CPU-only rack scenario mentioned above. The expected performance improvement for the Best Performance configuration compared to a CPU-only rack is 44X.

Scalability Testing

All scalability tests were completed using the ResNet-50 Training model with FP16 precision, 2 epochs, & a batch size of 512. The model was executed on 4 nodes with 1 VM per node for each test.

CPU Run Over 10G

GPU Run Over 10G

GPU Run Over 100G + RoCE

GPU Profile

N/A

GPU Profile

Bare-metal

GPU Profile

Bare-metal

Images/sec (Total)

212.87

Images/sec (Total)

6964.61

Images/sec (Total)

8484.07

Watts (Total)

2304

Watts (Total)

2337

Watts (Total)

2453

Number of Nodes

4 (1 VM per Node)

Number of Nodes

4 (1 VM per Node)

Number of Nodes

4 (1 VM per Node)

The first test compared CPU-only nodes to Nodes with a single A100 on a 10GbE network.

../_images/appendix-03.png

Note

  • Relative performance, 4 nodes, TensorFlow ResNet-50 V1.5 Training using Horovod, FP16, BS:512

  • Server Configuration: 2x Intel(R) Xeon(R) (Gold 6240R @2.4GHz), VMware vSphere 7.0u2,

  • CPU results: Ubuntu 18.04, 72 vCPU, 64 GB RAM, on-board 10GbE networking.

  • GPU results: Ubuntu 18.04, 16 vCPU, 64 GB RAM, NVIDIA vGPU 12.0 40C Profile, 1 x NVIDIA A100 per node, Driver 460.32.04, on-board 10GbE networking.

The above graph demonstrates a potential for a ~30x increase in training performance when utilizing servers with a single A100 GPU in each compared to servers using only CPUs. It is important to note that the amount of vCPU required for GPU accelerated VM’s was reduced by approximately 75% compared to non-GPU accelerated VM’s.

A high-performance 100GbE NVIDIA Mellanox Networking switch provides more throughput between the nodes, resulting in performance gains when executing multi-node AI Enterprise workloads. The following chart illustrates an example of the multi-node cluster performance gains when running Training workloads.

../_images/appendix-04.png

Note

  • Relative performance, 4 nodes, Tensorflow ResNet-50 V1.5 Training using Horovod, FP16, BS:512

  • Server Configuration: 2x Intel(R) Xeon(R) (Gold 6240R @2.4GHz), VMware vSphere 7.0u2,

  • GPU results: Ubuntu 18.04, 16 vCPU, 64 GB RAM, NVIDIA vGPU 12.0 40C Profile, 1 x NVIDIA A100 per node, Driver 460.32.04, on-board 10GbE networking.

  • GPU +100GbE results: Ubuntu 18.04, 16 vCPU, 64 GB RAM, NVIDIA vGPU 12.0 40C Profile, 1 x NVIDIA A100 per node, Driver 460.32.04, NVIDIA Mellanox CX-6 Dx 100G paired with NVIDIA SN3700

A high-performance networking infrastructure provides more throughput between the nodes, resulting in performance gains when executing multi-node AI Enterprise workloads. The following graph illustrates the increased throughput as Deep Learning Training workloads are scaled out to multi-nodes. The relative performance gains are linear as tests are scaled out.

../_images/appendix-05.png

Note

Server Config: Intel Xeon Gold(6240R @ 2.4GHz), Ubuntu 18.04, VMware vSphere 7.0u2, VM with 16 vCPU’s 64 GB RAM, NVIDIA vGPU 12.0 (40C profile), 1xNVIDIA A100 per node, Driver 460.32.04, NVIDIA Mellanox ConnectX-6 Dx, RoCE enabled, ATS enabled, TensorFlow Resnet-50 V1.5 Training using Horovod, FP16, BS: 512