Compute Workflows

Multi-Node Deep Learning Training with TensorFlow (0.1.0)

Deploying virtual GPUs (vGPU) for Deep Learning Training can be architected using four different approaches within a virtualized environment:

  • Single VM assigned a full or fractionalized-partitioned vGPU

  • Single VM with multiple NVIDIA NVLink vGPU devices

  • Multiple nodes (VMs)

Many data scientists are looking to scale compute resources to reduce the time it takes to complete the training of a neural network and produce results in real-time. Taking a Multi-GPU approach brings scientists closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms. As scientists train more and more models, the size and data consumptions can grow significantly.

Models can be small enough to run on one or more GPUs within a server, but as datasets grow, training times grow. This is where multi-node distributed training lends well for many organizations. The goal is to build a model using large datasets which understands patterns and relationships behind the data, rather than just the data itself. This requires an exchange of data between multi-nodes throughout the training process and GPUDirect RDMA with ATS provides high-performance networking between nodes. In order to illustrate this multi-node approach for the authoring of this guide, a minimum of two or more VMs is required. This architecture was used for demonstration purposes and each VM was assigned a full GPU. It is recommended that each organization evaluates their needs and chooses the correct architectural approach for executing Deep Learning workflows. The following sections describes each of the four different approaches in further detail in order to help assist architectural choices:

Single VM – Full or Fractionalized Partitioned vGPU

GPU partitioning is particularly beneficial for workloads that do not fully saturate the GPU’s compute or memory capacity. Some AI GPU workloads do not require a full GPU. For example, if you are giving a demo, you are building POC code or are testing out a smaller model, you do not need 40 GB of GPU memory which is offered by the NVIDIA A100 Tensor Core GPU. Without GPU partitioning, a user doing this type of work would have an entire GPU allocated, whether they are using it or not. This use case improves the GPU utilization for smaller to medium sized workloads which underutilize the GPU. Examples are Deep Learning training and inferencing workflows, which utilize smaller datasets. GPU Partitioning offers an efficient way to try different hyperparameters but is highly dependent on the size of the data/model, users may need to decrease batch sizes. Using two different NVIDIA GPU technologies, GPUs are partitioned using either NVIDIA vGPU software temporal partitioning or Multi-Instance GPU (MIG) spatial partitioning. Please refer to the GPU Partitioning technical brief to understand the differences. For more information on how to deploy MIG partitions, please refer to the NVIDIA AI Enterprise Deployment Guide.


Ensure the VMs are configured to use the proper NUMA affinity.


NVIDIA AI Enterprise supports peer-to-peer computing where multiple GPUs are connected through NVIDIA NVLink. This enables a high-speed, direct GPU-to-GPU interconnect that provides higher bandwidth for multi-GPU system configurations than traditional PCIe-based solutions. The following graphic illustrates peer-to-peer NVLINK.


Peer-to-Peer CUDA Transfers over NVLink is supported for Linux only. Peer-to-Peer communication is restricted within a single VM and does not communicate between multiple VM’s. There is no SLI support; therefore, graphics are not included in this support, only CUDA. Peer-to-Peer CUDA Transfers over NVLink is supported only on a subset of vGPUs, Hypervisor releases, and guest OS releases. Peer-to-Peer over PCIe is unsupported. Non-MIG and C-series full frame buffer (1:1) vGPU profiles are supported with NVLink. Refer to the vGPU latest release notes for a list of GPUs which are supported.


For servers with more than four GPUs, multi-vGPU config will support only GPU passthrough configurations that are manually configured on the recommended NUMA nodes.

Multiple Nodes (VMs)

Scaling across multiple GPUs enable quicker training times; however, training libraries need to support inter-GPU communications and these can have an overhead. Frequently, users are required to modify training code to facilitate inter-GPU communication. Horovod is an open-source library that enables multi-node training out-of-box and is included in the NVIDIA AI Enterprise container image of TensorFlow. Horovod uses advanced algorithms and can leverage features of high-performance networks such as RDMA with ATS.

With this guide, you will become familiar with configuring RDMA with ATS and executing high-performance multi-node training using 2 VMs. Using the steps outlined in this guide, Enterprises can scale up to as many VMs as needed using a tool such as Horovod.


Horovod enables multi-GPU as well as multi-node; therefore, the same code can be used to scale up with minor tweaks to batch size.

The graphic below illustrates the multi-node architecture used. The server has VMware vSphere 7.0 Update 2 installed, and it is hosting 2 VMs. The server has two A100 GPUs. Therefore, each VM was assigned a complete A100-40C vGPU profile.



NVIDIA AI Enterprise 2.0 or later

For servers with more than 4 GPUs, multi-vGPU config will support only GPU passthrough configurations that are manually configured (i.e. ones that don’t use AH’s Dynamic Direct Path); so long as the user configures the GPUs on recommended NUMA node

RDMA allows direct memory access from the memory of one computer to the memory of another computer without involving the operating system or CPU. NVIDIA ConnectX network adapters are certified for VMware over RDMA, which offloads CPU communication tasks to boost application performance and improve infrastructure returns on investment. Network protocols, such as InfiniBand and RDMA over Converged Ethernet (RoCE), support RDMA. Within this guide, we will configure an NVIDIA ConnectX-6 Dx for RoCE. RDMA with ATS provides the GPU complete access to CPU memory. Within this guide, we will enable ATS on NVIDIA ConnectX-6 Dx NIC, VMware ESXi, and Virtual Machines.

The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includes a container runtime library and utilities to configure containers to leverage NVIDIA GPUs automatically. Complete documentation and frequently asked questions are available on the repository wiki.


Previous Overview
Next Requirements
© Copyright 2024, NVIDIA. Last updated on Apr 2, 2024.