Created on Jun 30, 2019 by Boris Kovalev
Introduction
This Reference Deployment Guide (RDG) provides how to install and configure ML environments with GPUDirect RDMA, NVIDIA ConnectX®-4/5 VPI PCI Express Adapter Cards, NVIDIA Spectrum with ONYX OS and running RoCE over a lossless network, in DSCP-based QoS mode.
This guide assumes VMware ESXi 6.7 Update 1 native and NVIDIA Onyx™ version 3.6.8190 and above.
References
- What is RDMA over Converged Ethernet (RoCE)?
- Lossless RoCE Configuration for NVIDIA ONYX Switches in DSCP-Based QoS Mode
- NVIDIA OFED GPUDirect RDMA
- Recommended Network Configuration Examples for RoCE Deployment
- Scaling HPC and ML with GPUDirect RDMA on vSphere 6.7
- How to Enable Nvidia V100 GPU in Passthrough mode on vSphere for Machine Learning and Other HPC Workloads - Virtualize A…
- GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs
- GitHub - uber/horovod: Distributed training framework for TensorFlow, Keras, and PyTorch.
- vSphere Command-Line Interface Concepts and Examples
Overview
NVIDIA Accelerated Machine Learning
NVIDIA Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. NVIDIA solutions enable companies and organizations such as Baidu, NVIDIA, JD.com, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.
In this post we will show how to build most efficient Machine Learning cluster enhanced by RoCE over 100 Gbps Ethernet network.
Device Partitioning (SR-IOV)
The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV).
A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs.
An ESXi driver and a guest driver are required for SR-IOV.
NVIDIA supports ESXi SR-IOV for both InfiniBand and RoCE interconnects.
How To Configure SR-IOV for NVIDIA ConnectX-4/5 adapter cards family on ESXi 6.5/6.7 Server (Native Ethernet)
Downsides: No vMotion, No Snapshots.
VM Direct Path I/O
Allows PCI devices to be accessed directly by guest OS.
- Examples: GPUs for computation (GPGPU), ultra-low latency interconnects like InfiniBand and RoCE.
Full device is made available to a single VM – no sharing.
No ESXi driver required – just the standard vendor device driver.
How To Configure Nvidia GPU device into and from VMDirectPath I/O passthrough mode on VMware ESXi 6.x server
Downsides: No vMotion, No Snapshots.
NVIDIA OFED GPUDirect RDMA
GPUDirect RDMA is an API between IB CORE and peer memory clients, such as NVIDIA Tesla (Volta, Pascal) class GPU's. It provides access the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. It works seamlessly using RoCE technology with the NVIDIA ConnectX®-4 and later VPI adapters.
The latest advancement in GPU-GPU communications is GPUDirect RDMA. This new technology provides a direct P2P (Peer-to-Peer) data path between the GPU Memory directly to/from the NVIDIA HCA devices. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network.
Hardware and Software Requirements
1. A server platform with an adapter card based on one of the following NVIDIA ConnectX®-4/5 HCA devices.
2. A switch is one of the following NVIDIA switches:
3. VMware vSphere 6.7 u1 Cluster installed and configured.
4. VMware vCenter 6.7 u1 installed and configured.
5 . For Using GPUs Based on the Pascal and Volta Architectures in Pass-Through Mode
- The Tesla V100, Tesla P100, and Tesla P6 GPUs require 32 GB of MMIO space in pass-through mode.
- Pass through of GPUs with large BAR memory settings has some restrictions on VMware ESXi:
- The guest OS must be a 64-bit OS.
- 64-bit MMIO and EFI boot must be enabled for the VM.
- The guest OS must be able to be installed in EFI boot mode - to access these large memory mappings.
6. NVIDIA® Driver.
7. Installer Privileges: The installation requires administrator privileges on the target machine.
Setup Overview
Before you start, make sure you are familiar with VMware vSphere and vCenter deploy and manage procedures.
This guide does not contain step-by-step instructions for performing all of the required standard vSphere and vCenter installation and configuration tasks because they often depend on customer requirements.
Make sure you are aware of the Uber Horovod distributed training framework, see GitHub - uber/horovod: Distributed training framework for TensorFlow, Keras, and PyTorch for more info.
In the distributed TensorFlow/Horovod configuration described in this guide, we are using the following hardware specification.
Equipment
Logical Design
Bill of Materials - BOM
In the distributed ML configuration described in this guide, we are using the following hardware specification.

Physical Network Connections
vSphere Cluster Design
Network Configuration
In our reference we will use a single port per server. In case of a single port NIC we will wire the available port. In case of dual port NIC we will wire the 1st port to an Ethernet switch and will not use the 2nd port.
We will cover the procedure later in the Installing NVIDIA OFED section.
Each server is connected to the SN2700 switch by a 100GbE copper cable.
The switch port connectivity in our case is as follow:
- 1st -8nd ports – connected to ESXi Servers
Server names with network configuration provided below:
Server type | Server name | IP and NICS | |
---|---|---|---|
Internal network - | Management network - | ||
100 GigE | 1 GigE | ||
Node 01 | clx-mld-41 | enp1f0: 31.31.31.41 | eno0: From DHCP (reserved) |
Node 02 | clx-mld-42 | enp1f0: 31.31.31.42 | eno0: From DHCP (reserved) |
Node 03 | clx-mld-43 | enp1f0: 31.31.31.43 | eno0: From DHCP (reserved) |
Node 04 | clx-mld-44 | enp1f0: 31.31.31.44 | eno0: From DHCP (reserved) |
Node 05 | clx-mld-45 | enp1f0: 31.31.31.45 | eno0: From DHCP (reserved) |
Node 06 | clx-mld-46 | enp1f0: 31.31.31.46 | eno0: From DHCP (reserved) |
Node 07 | clx-mld-47 | enp1f0: 31.31.31.47 | eno0: From DHCP (reserved) |
Node 08 | clx-mld-48 | enp1f0: 31.31.31.48 | eno0: From DHCP (reserved) |
Network Switch Configuration
Please start from the How To Get Started with NVIDIA switches guide if you don't familiar with NVIDIA switch software.
For more information please refer to the NVIDIA ONYX User Manual located at support.mellanox.com orwww.mellanox.com -> Switches

We will accelerate HPC or ML application cluster by using RDMA transport.
There are several industry standard network configuration for RoCE deployment.
You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.
In our deployment we will configure our network to be lossless and will use DSCP on host and switch sides:
- For switch side please configure your switch accordingly with the Lossless RoCE Configuration for NVIDIA ONYX Switches in DSCP-Based QoS Mode document
- The host side will be covered later in the Installing MLNX_OFED for Ubuntu on the Master and Workers section.
Below is our switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration:
swx-mld-1-2 [standalone: master] > enable swx-mld-1-2 [standalone: master] # configure terminal swx-mld-1-2 [standalone: master] (config) # show running-config
## ## Running database "initial" ## Generated at 2018/03/10 09:38:38 +0000 ## Hostname: swx-mld-1-2 ## ## ## Running-config temporary prefix mode setting ## no cli default prefix-modes enable ## ## License keys ## license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD ## ## Interface Ethernet buffer configuration ## traffic pool roce type lossless traffic pool roce memory percent 50.00 traffic pool roce map switch-priority 3 ## ## LLDP configuration ## lldp ## ## QoS switch configuration ## interface ethernet 1/1-1/32 qos trust L3 interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500 ## ## DCBX ETS configuration ## interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict ## ## Other IP configuration ## hostname swx-mld-1-2 ## ## AAA remote server configuration ## # ldap bind-password ******** # radius-server key ******** # tacacs-server key ******** ## ## Network management configuration ## # web proxy auth basic password ******** ## ## X.509 certificates configuration ## # # Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991 # (public-cert config omitted since private-key config is hidden) ## ## Persistent prefix mode setting ## cli default prefix-modes enable
Environment Preparation
1. Host BIOS
- Enable the “above 4G decoding” or “memory mapped I/O above 4GB” or “PCI 64-bit resource handing above 4G” in hosts BIOS.
- Make sure that SR-IOV is enabled.
- Make sure that "Intel Virtualization Technology" is enabled.
2. ESXi Host Software
ESXi host is included:

3. VM Template preparation
VM Template is included:
3.1. Configuring EFI Boot Mode.
Before installing the Guest OS into the VM, ensure that the “EFI” is enabled in the Firmware area.
For correct GPU use a guest operating system within the virtual machine must boot in "EFI" mode.
To access the setting for this:
- Right-click on the Virtual Machine and click Edit Settings.
- Click on VM Options.
- Click onBoot Options
- Select EFI in Firmware area.
3.2. Installing the Guest Operating System in the VM.
Install the Ubuntu 16.04 as Guest Operating System into the virtual machine.
3.3. Install Nvidia driver in the VM.
The standard vendor GPU driver must also be installed within the guest operating system.
3.4. Configure SR-IOV for NVIDIA ConnectX® 5 adapter card and Add a Network Adapter to the VM in SR-IOV Mode.

3.5. Configure Nvidia GPU device into VMDirectPath I/O passthrough mode and Assign a GPU Device to the VM.

3.6. Adjusting the Memory Mapped I/O Settings for the VM.
With the above requirements satisfied, two entries must be added to the VM’s VMX file, either by modifying the file directly or by using the vSphere client to add these capabilities. The first entry is:
pciPassthru.use64bitMMIO=“TRUE”
Specifying the 2nd entry requires a simple calculation. Sum the GPU memory sizes of all GPU devices(*) you intend to pass into the VM and then round up to the next power of two. For example, to use passthrough with two 16 GB P100 devices, the value would be: 16 + 16 = 32, rounded up to the next power of two to yield 64. Use this value in the 2nd entry:
pciPassthru.64bitMMIOSizeGB=“64”
With these two changes to the VMX file, follow the vSphere instructions for enabling passthrough devices at the host level and for specifying which devices should be passed into your VM. The VM should now boot correctly with your device(s) in passthrough mode.
3.7. Install NVIDIA OFED into Virtual Machine Template.

Done !
( Optional ) - Deploy and Run a Horovod framework
Docker installing and configured into VM Template.
Uninstall old versions
To uninstall old versions, we recommend run following command:
sudo apt-get remove docker docker-engine docker.io
It’s OK if apt-get reports that none of these packages are installed.
The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.
For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.
Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.
Set Up the repository
Update the apt package index:
sudo apt-get update
Install packages to allow apt to use a repository over HTTPS:
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
Add Docker’s official GPG key:
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88:
sudo apt-key fingerprint 0EBFCD88 pub 4096R/0EBFCD88 2017-02-22 Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 uid Docker Release (CE deb) <docker@docker.com> sub 4096R/F273FCD8 2017-02-22
Install Docker CE
Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced:
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu$(lsb_release -cs) stable" $ sudo apt-get update $ sudo apt-get install docker-ce
Customize the docker0 bridge
The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it. You can specify one or more of the following settings to configure the default bridge network:
{ "bip": "172.16.41.1/24", "fixed-cidr": "172.16.41.0/24", "mtu": 1500, "dns": ["8.8.8.8","8.8.4.4"] }
The same options are presented as flags to dockerd, with an explanation for each:
- --bip=CIDR: supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example: 172.16.41.1/16.
- --fixed-cidr=CIDR: restrict the IP range from the docker0 subnet, using standard CIDR notation. For example: 172.16.41.0/16 .
- --mtu=BYTES: override the maximum packet length on docker0. For example: 1500 .
- --dns=[]: The DNS servers to use. For example: --dns=8.8.8.8,8.8.4.4.
Restart Docker after making changes to the daemon.json file:
sudo /etc/init.d/docker restart
Set communicating to the outside world
Check ip forwarding is enabled in kernel:
sysctl net.ipv4.conf.all.forwarding net.ipv4.conf.all.forwarding = 1
If disabled:
net.ipv4.conf.all.forwarding = 0
Please enable and check again:
sysctl net.ipv4.conf.all.forwarding=1
For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the FORWARD chain to DROP .
To override this default behavior you can manually change the default policy:
sudo iptables -P FORWARD ACCEPT
Add IP route with specific subnet
On each host you shall add routing to container subnet on other hosts. Please see example for routing to be added on one host-41 :
sudo ip route add 172.16.42.0/24 via 31.13.13.42 sudo ip route add 172.16.43.0/24 via 13.13.13.43 sudo ip route add 172.16.44.0/24 via 13.13.13.44
A quick check on each host
Give your environment a quick test by spawning simple container:
docker run hello-world
Nvidia-docker Deploy into VM Template.
To deploy nvidia-docker on Ubuntu 16.04 please go by following steps:
- If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers:
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f sudo apt-get purge -y nvidia-docker
Add the package repositories:
VM CLIcurl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get updatee
Install nvidia-docker2 and reload the Docker daemon configuration:
VM CLIsudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd
Test nvidia-smi with the latest official CUDA image:
VM CLIdocker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Horovod Deploy into VM Template.
The procedure is explain how to build and run a Horovod framework in Docker Container.
Install additional packages:VM CLIsudo apt install libibverbs-dev sudo apt install libmlx5-dev
- Install NVIDIA OFED by the link - How to Installing NVIDIA OFED on Linux.
Horovod VGG 16 Benchmark Results.
Horovod benchmark was ran by the link.
Done!
About the Authors
Boris Kovalev For the past several years, Boris Kovalev has worked as a solution architect at NVIDIA, responsible for complex machine learning and advanced VMware-based cloud research and design. Previously, he spend more than 15 years as a senior consultant and solution architect at multiple companies, most recently at VMware. He's written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions, which are available at the NVIDIA Documents website. |
Related Documents