Created on Jun 30, 2019 by Boris Kovalev

Introduction

This Reference Deployment Guide (RDG) provides how to install and configure ML environments with GPUDirect RDMA, NVIDIA ConnectX®-4/5 VPI PCI Express Adapter Cards, NVIDIA Spectrum with ONYX OS and running RoCE over a lossless network, in DSCP-based QoS mode.

This guide assumes VMware ESXi 6.7 Update 1 native and NVIDIA Onyx™ version 3.6.8190 and above.

References

Overview

NVIDIA Accelerated Machine Learning

NVIDIA Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. NVIDIA solutions enable companies and organizations such as Baidu, NVIDIA, JD.com, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.
In this post we will show how to build most efficient Machine Learning cluster enhanced by RoCE over 100 Gbps Ethernet network.

Device Partitioning (SR-IOV)

The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV).
A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs.
An ESXi driver and a guest driver are required for SR-IOV.
NVIDIA supports ESXi SR-IOV for both InfiniBand and RoCE interconnects.
How To Configure SR-IOV for NVIDIA ConnectX-4/5 adapter cards family on ESXi 6.5/6.7 Server (Native Ethernet)
DownsidesNo vMotion, No Snapshots.

VM Direct Path I/O

Allows PCI devices to be accessed directly by guest OS.

  • Examples: GPUs for computation (GPGPU), ultra-low latency interconnects like InfiniBand and RoCE.

Full device is made available to a single VM – no sharing.
No ESXi driver required – just the standard vendor device driver.
How To Configure Nvidia GPU device into and from VMDirectPath I/O passthrough mode on VMware ESXi 6.x server
DownsidesNo vMotion, No Snapshots

NVIDIA OFED GPUDirect RDMA

GPUDirect RDMA is an API between IB CORE and peer memory clients, such as NVIDIA Tesla (Volta, Pascal) class GPU's. It provides access the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. It works seamlessly using RoCE technology with the NVIDIA ConnectX®-4 and later VPI adapters.
The latest advancement in GPU-GPU communications is GPUDirect RDMA. This new technology provides a direct P2P (Peer-to-Peer) data path between the GPU Memory directly to/from the NVIDIA HCA devices. This provides a significant decrease in GPU-GPU communication latency and completely offloads the CPU, removing it from all GPU-GPU communications across the network.

Hardware and Software Requirements

1. A server platform with an adapter card based on one of the following NVIDIA ConnectX®-4/5 HCA devices.
2. A switch is one of the following NVIDIA switches:

3. VMware vSphere 6.7 u1 Cluster installed and configured.
4. VMware vCenter 6.7 u1 installed and configured.
5 . For Using GPUs Based on the Pascal and Volta Architectures in Pass-Through Mode 

6. NVIDIA® Driver.
7. Installer Privileges: The installation requires administrator privileges on the target machine. 

Setup Overview

Before you start, make sure you are familiar with VMware vSphere and vCenter deploy and manage procedures.
This guide does not contain step-by-step instructions for performing all of the required standard vSphere and vCenter installation and configuration tasks because they often depend on customer requirements.
Make sure you are aware of the Uber Horovod distributed training framework, see GitHub - uber/horovod: Distributed training framework for TensorFlow, Keras, and PyTorch for more info.
In the distributed TensorFlow/Horovod configuration described in this guide, we are using the following hardware specification.

Equipment

Logical Design

Bill of Materials - BOM

In the distributed ML configuration described in this guide, we are using the following hardware specification.


enlightened This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size).

Physical Network Connections

 
 

vSphere Cluster Design

Network Configuration

In our reference we will use a single port per server. In case of a single port NIC we will wire the available port. In case of dual port NIC we will wire the 1st port to an Ethernet switch and will not use the 2nd port.
We will cover the procedure later in the Installing NVIDIA OFED section.
Each server is connected to the SN2700 switch by a 100GbE copper cable.
The switch port connectivity in our case is as follow:

  • 1st -8nd ports – connected to ESXi Servers

Server names with network configuration provided below:

Server type Server name IP and NICS
Internal network - Management network  -
100 GigE 1 GigE
Node 01clx-mld-41enp1f0: 31.31.31.41eno0: From DHCP (reserved)
Node 02clx-mld-42enp1f0: 31.31.31.42eno0: From DHCP (reserved)
Node 03clx-mld-43enp1f0: 31.31.31.43eno0: From DHCP (reserved)
Node 04clx-mld-44enp1f0: 31.31.31.44eno0: From DHCP (reserved)
Node 05clx-mld-45enp1f0: 31.31.31.45eno0: From DHCP (reserved)
Node 06clx-mld-46enp1f0: 31.31.31.46eno0: From DHCP (reserved)
Node 07clx-mld-47enp1f0: 31.31.31.47eno0: From DHCP (reserved)
Node 08clx-mld-48enp1f0: 31.31.31.48eno0: From DHCP (reserved)

Network Switch Configuration

Please start from the How To Get Started with NVIDIA switches guide if you don't familiar with NVIDIA switch software.

For more information please refer to the NVIDIA ONYX User Manual located at support.mellanox.com orwww.mellanox.com -> Switches

enlightened In first step please update your switch OS to the latest ONYX OS software. Please use this community guide HowTo Upgrade NVIDIA ONYX Software on NVIDIA switch systems.

We will accelerate HPC or ML application cluster by using RDMA transport.

There are several industry standard network configuration for RoCE deployment.
You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.
In our deployment we will configure our network to be lossless and will use DSCP on host and switch sides:

Below is our switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration:

swx-mld-1-2 [standalone: master] > enable
swx-mld-1-2 [standalone: master] # configure terminal
swx-mld-1-2 [standalone: master] (config) # show running-config
##
## Running database "initial"
## Generated at 2018/03/10 09:38:38 +0000
## Hostname: swx-mld-1-2
##
##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable

##
## License keys
##
license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD

##
## Interface Ethernet buffer configuration
##
traffic pool roce type lossless
traffic pool roce memory percent 50.00
traffic pool roce map switch-priority 3

##
## LLDP configuration
##
lldp
##
## QoS switch configuration
##
interface ethernet 1/1-1/32 qos trust L3
interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

##
## DCBX ETS configuration
##
interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict


##
## Other IP configuration
##
hostname swx-mld-1-2

##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********

##
## Network management configuration
##
# web proxy auth basic password ********

##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991
# (public-cert config omitted since private-key config is hidden)

##
## Persistent prefix mode setting
##
cli default prefix-modes enable

Environment Preparation

1. Host BIOS

  • Enable the “above 4G decoding” or “memory mapped I/O above 4GB” or “PCI 64-bit resource handing above 4G” in hosts BIOS.
  • Make sure that SR-IOV is enabled.
  • Make sure that "Intel Virtualization Technology" is enabled.

2. ESXi Host Software

ESXi host is included:

enlightenedThe ConnectX Driver installation procedure on ESXi host is explained here.

3. VM Template preparation

VM Template is included:

3.1. Configuring EFI Boot Mode.

Before installing the Guest OS into the VM, ensure that the “EFI” is enabled in the Firmware area.
For correct GPU use a guest operating system within the virtual machine must boot in "EFI" mode.
To access the setting for this:

  1. Right-click on the Virtual Machine and click Edit Settings.
  2. Click on VM Options.
  3. Click onBoot Options
  4. Select EFI in Firmware area.

3.2. Installing the Guest Operating System in the VM.

Install the Ubuntu 16.04 as Guest Operating System into the virtual machine.

3.3. Install Nvidia driver in the VM.

The standard vendor GPU driver must also be installed within the guest operating system.

3.4. Configure SR-IOV for NVIDIA ConnectX® 5 adapter card and Add a Network Adapter to the VM in SR-IOV Mode.

enlightenedThis post describes how to configure the NVIDIA ConnectX-5 driver with an SR-IOV (Ethernet) for ESXi 6.7 Native driver and Add the Network Adapter to the VM in SR-IOV Mode.

3.5. Configure Nvidia GPU device into VMDirectPath I/O passthrough mode and Assign a GPU Device to the VM.

enlightenedThis post describes how To Configure Nvidia GPU device into and from VMDirectPath I/O passthrough mode on VMware ESXi 6.x server and Assign the GPU Device to the VM.

3.6. Adjusting the Memory Mapped I/O Settings for the VM.

With the above requirements satisfied, two entries must be added to the VM’s VMX file, either by modifying the file directly or by using the vSphere client to add these capabilities. The first entry is:


pciPassthru.use64bitMMIO=“TRUE”

Specifying the 2nd entry requires a simple calculation. Sum the GPU memory sizes of all GPU devices(*) you intend to pass into the VM and then round up to the next power of two. For example, to use passthrough with two 16 GB P100 devices, the value would be: 16 + 16 = 32, rounded up to the next power of two to yield 64. Use this value in the 2nd entry:

pciPassthru.64bitMMIOSizeGB=“64”

With these two changes to the VMX file, follow the vSphere instructions for enabling passthrough devices at the host level and for specifying which devices should be passed into your VM. The VM should now boot correctly with your device(s) in passthrough mode.

3.7. Install NVIDIA OFED into Virtual Machine Template.

enlightenedThis post describes How to Installing NVIDIA OFED on Linux.


Done !

( Optional ) - Deploy and Run a Horovod framework

Docker installing and configured into VM Template.

Uninstall old versions

To uninstall old versions, we recommend run following command:

sudo apt-get remove docker docker-engine docker.io

It’s OK if apt-get reports that none of these packages are installed.

The contents of /var/lib/docker/, including images, containers, volumes, and networks, are preserved.
For Ubuntu 16.04 and higher, the Linux kernel includes support for OverlayFS, and Docker CE will use the overlay2 storage driver by default.
Before you install Docker CE for the first time on a new host machine, you need to set up the Docker repository. Afterward, you can install and update Docker from the repository.

Set Up the repository

Update the apt package index:

sudo apt-get update

Install packages to allow apt to use a repository over HTTPS:

sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

Add Docker’s official GPG key:

sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88:

sudo apt-key fingerprint 0EBFCD88
pub 4096R/0EBFCD88 2017-02-22
Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
uid Docker Release (CE deb) <docker@docker.com>
sub 4096R/F273FCD8 2017-02-22

Install Docker CE

Install the latest version of Docker CE, or go to the next step to install a specific version. Any existing installation of Docker is replaced:

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu$(lsb_release -cs) stable"
$ sudo apt-get update
$ sudo apt-get install docker-ce

Customize the  docker0  bridge

The recommended way to configure the Docker daemon is to use the daemon.json file, which is located in /etc/docker/ on Linux. If the file does not exist, create it. You can specify one or more of the following settings to configure the default bridge network:

{
"bip": "172.16.41.1/24",
"fixed-cidr": "172.16.41.0/24",
"mtu": 1500,
"dns": ["8.8.8.8","8.8.4.4"]
}

The same options are presented as flags to dockerd, with an explanation for each:

  • --bip=CIDR: supply a specific IP address and netmask for the docker0 bridge, using standard CIDR notation. For example: 172.16.41.1/16.
  • --fixed-cidr=CIDR: restrict the IP range from the docker0 subnet, using standard CIDR notation. For example:  172.16.41.0/16 .
  • --mtu=BYTES: override the maximum packet length on docker0. For example:  1500 .
  • --dns=[]: The DNS servers to use. For example: --dns=8.8.8.8,8.8.4.4.

Restart Docker after making changes to the daemon.json file:

sudo /etc/init.d/docker restart


Set communicating to the outside world

Check ip forwarding is enabled in kernel:

sysctl net.ipv4.conf.all.forwarding
net.ipv4.conf.all.forwarding = 1

If disabled:

net.ipv4.conf.all.forwarding = 0

Please enable and check again:

sysctl net.ipv4.conf.all.forwarding=1

For security reasons, Docker configures the iptables rules to prevent traffic forwarding to containers from outside the host machine. Docker sets the default policy of the  FORWARD  chain to  DROP .

To override this default behavior you can manually change the default policy:

sudo iptables -P FORWARD ACCEPT


Add IP route with specific subnet

On each host you shall add routing to container subnet on other hosts. Please see  example  for routing to be added on one  host-41 :

sudo ip route add 172.16.42.0/24 via 31.13.13.42
sudo ip route add 172.16.43.0/24 via 13.13.13.43
sudo ip route add 172.16.44.0/24 via 13.13.13.44


A quick check on each host

Give your environment a quick test by spawning simple container:

docker run hello-world

Nvidia-docker Deploy into VM Template.

To deploy nvidia-docker on Ubuntu 16.04 please go by following steps:

  1. If you have  nvidia-docker 1.0  installed: we need to  remove  it and all existing GPU containers:
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker
  1. Add the package repositories:

    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
      sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get updatee
  2. Install  nvidia-docker2  and reload the Docker daemon configuration:

    sudo apt-get install -y nvidia-docker2
    sudo pkill -SIGHUP dockerd
  3. Test  nvidia-smi  with the latest official CUDA image:

    docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

Horovod Deploy into VM Template.

  1. The procedure is explain how to build and run a Horovod framework in Docker Container.
    Install additional packages:

    sudo apt install libibverbs-dev
    sudo apt install libmlx5-dev
  2. Install NVIDIA OFED by the link - How to Installing NVIDIA OFED on Linux.

Horovod VGG 16 Benchmark Results.

Horovod benchmark was ran by the link.


Done!


About the Authors



Boris Kovalev

For the past several years, Boris Kovalev has worked as a solution architect at NVIDIA, responsible for complex machine learning and advanced VMware-based cloud research and design. Previously, he spend more than 15 years as a senior consultant and solution architect at multiple companies, most recently at VMware. He's written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions, which are available at the NVIDIA Documents website. 



Related Documents