Getting Started#
To leverage RDMA with ATS for high performance compute, the following steps are outlined within this guide:
Configure NVIDIA ConnectX-6 Dx for RoCE
Enable ATS on VMware ESXi and Virtual Machines
Enable ATS on NVIDIA Connect X-6 DX NIC
Configure NUMA Affinity
Creating Docker File for Multi-Node Training
Setup Keyless Entry between VM’s on the multi-node cluster
Run Sample ResNet-50 Multi-Node Training
Configure NVIDIA ConnectX-6 Dx NIC and Spectrum switch for RoCE#
To leverage RoCE, the NVIDIA ConnectX-6 Dx NIC must run RoCE over a loss network in DSCP-based QoS mode. The following knowledge Article is a helpful resource for applying this configuration: https://community.mellanox.com/s/article/lossless-roce-configuration-for-mlnx-os-switches-in-dscp-based-qos-mode
For this guide, we will reference configuration steps within the Knowledge Article for version 3.8.2008 and above.
Run the following commands on the NVIDIA switch:
switch (config) # roce
Note
The RoCE feature has been Automated so that all that is needed to run RoCE on lossless fabric is running the
roce
command.Create an isolated vLAN and place the NVIDIA ConnectX NICs into the created vLAN as access ports. The four servers connected into switch ports 1/1 - 1/4.
1switch (config) # interface vlan 111 2switch (config vlan 111) # exit 3switch (config) # interface ethernet 1/1-1/4 switchport access vlan 111
Set the MTU to 9216 on the interfaces (on versions below 3.9.2110, the switch’s default MTU is 1500).
1switch (config) # interface ethernet 1/1-1/4 shutdown 2switch (config) # interface ethernet 1/1-1/4 mtu 9216 3switch (config) # interface ethernet 1/1-1/4 no shutdown
Optional, if you are running Cumulus Linux, follow these instructions to enable RoCE: https://docs.cumulusnetworks.com/cumulus-linux-42/Network-Solutions/RDMA-over-Converged-Ethernet-RoCE/.
Enable ATS on VMware ESXi and VMs#
To enable Peer-2-Peer (P2P) with high performance, we will enable ATS by updating the VMKernel and then the VM configuration.
Update the VMKernel for Peer-2-Peer (P2P).
To enable the ATS boot option, invoke the following command and reboot ESXi:
esxcli system settings kernel set -s atsSupport -v TRUE
Verify the value is correct after reboot, invoke:
esxcli system settings kernel list -o atsSupport
The output should resemble the following:
1Name Type Configured Runtime Default Description 2------------ ------- ---------- ------- ------- ----------- 3atsSupport Bool TRUE TRUE FALSE Enable Support for PCIe ATS
Update the VM configuration for P2P.
Edit the VM configuration settings:
1pciPassthru.allowP2P=true # enable P2P 2pciPassthru.RelaxACSforP2P=true # update ACS capabilities in switch
Note
When relaxing ACS for P2P is enabled, VMware will locate an ATS capable passthrough device, find its parent switch, and enable the ACS Direct Translated bit. The previous restriction that all functions of peer networking devices must be given to a single VM has been removed. Each function of a peer device can be given to a separate VM.
If there are multiple GPU physical devices, the VM can specify a specific device for P2P with existing config:
pciPassthru0.cfg.gpu-pci-id = "ssss:bb:dd.f"
Note
The
gpu-pci-id
is in hex SBDF format. If the GPU is in SR-IOV mode, you should specify a VF address.
Enable ATS on the NVIDIA ConnectX-6 Dx NIC#
Install python 2.7 with the command below:
sudo apt-get install python
Download and install MLNX OFED 5.0: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/.
Select (OS/Version/Architecture) and download tar file, for example: (Ubuntu/20.04/x86_64).
Download, then copy the package to the VMs, and run the following commands to extract and install:
1tar xvf MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz 2cd MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz 3sudo ./mlnxofedinstall
Note
The above step will also update firmware for all CX5 or CX6 cards.
Run the following command after the install is complete:
sudo /etc/init.d/openibd restart
Note
During the install process, the CX-6 NICs are detected, and OFED should update the firmware. If this fails, download the latest firmware and update manually. Repeat the OFED install after.
Check OFED and Firmware versions using the following commands:
1dpkg -l | grep mlnx-ofed 2cat /sys/class/infiniband/mlx5*/fw_ver
Start Mellanox software tools:
sudo mst start
Check the status of the ATS_ENABLED Configuration for the CX-6 NIC using the below command. You should see output similar to the following:
1sudo mlxconfig -d /dev/mst/mt4123_pciconf0 query | grep -i ATS 2ATS_ENABLED False(0)
If it is not present, the firmware does not support ATS. Update to a version of the firmware that does. If set to False, use the following command to enable ATS:
1sudo mlxconfig -d /dev/mst/mt4123_pciconf0 set ATS_ENABLED=true 2Device #1: 3---------- 4Device type: ConnectX6 5Name: MCX653105A-HDA_Ax 6Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 7Device: /dev/mst/mt4123_pciconf0 8 9Configurations: Next Boot New 10ATS_ENABLED False(0) True(1) 11Apply new Configuration? (y/n) [n] : y 12Applying... Done! 13-I- Please reboot machine to load new configurations.
Once you have enabled ATS on the CX-6 on both VMs, put the host in maintenance mode and reboot the ESXi host.
Note
If you have vMotion configured between two hosts, then VMs on a host can move to another running host while the necessary reboots occur to enable ATS.
Note
Remember to re-submit the command to enable the ACS Direct Translated bit on the PCIe switch.
After the ESXi host reboot is complete, power back on vCenter and the VMs.
Next, verify that ATS is enabled on VMs by running the following commands:
1sudo mst start 2sudo mlxconfig -d /dev/mst/mt4123_pciconf0 query | grep -i ATS 3sudo lspci -vvv
Search for the Mellanox CX-6 device, and verify the output contains the ATS Capability as configured below:
1Capabilities: [480 v1] Address Translation Service (ATS) 2 ATSCap: Invalidate Queue Depth: 00 3 ATSCtl: Enable+, Smallest Translation Unit: 00
Note
Enable+ indicates it’s been successfully enabled.
Configure NUMA Affinity for the VMs#
Check which NUMA node your NICs and GPUs are attached to, run the following command on the ESXi host:
1esxcli hardware pci list | grep -A 30 -B 10 NVIDIA 2esxcli hardware pci list | grep -A 30 -B 10 Mellanox
The following output describes the devices NUMA node:
10000:3b:02.3 2 Address: 0000:3b:02.3 3 Segment: 0x0000 4 Bus: 0x3b 5 Slot: 0x02 6 Function: 0x3 7 VMkernel Name: PF_0.59.0_VF_15 8 Vendor Name: NVIDIA Corporation 9 Device Name: NVIDIAA100-PCIE-40GB 10 Configured Owner: VMkernel 11 Current Owner: VMkernel 12 Vendor ID: 0x10de 13 Device ID: 0x20f1 14 SubVendor ID: 0x10de 15 SubDevice ID: 0x0000 16 Device Class: 0x0302 17 Device Class Name: 3D controller 18 Programming Interface: 0x00 19 Revision ID: 0xa1 20 Interrupt Line: 0xff 21 IRQ: 255 22 Interrupt Vector: 0x00 23PCI Pin: 0xff 24 Spawned Bus: 0x00 25 Flags: 0x0001 26 Module ID: 54 27 Module Name: nvidia 28 Chassis: 0 29 Physical Slot: -1 30 Slot Description: 31 Device Layer Bus Address: s00000001.00.vf15 32 Passthru Capable: true 33 Parent Device: PCI 0:58:0:0 34 Dependent Device: PCI 0:59:2:3 35 Reset Method: Function reset 36 FPT Sharable: true 37 NUMA Node: 0 38 Extended Device ID: 65535 39 Extended Device Name:
Make sure the NIC and the GPU are on the same NUMA node.
Within the VM configuration, add a new key-value:
numa.nodeAffinity = <numa node value>
Creating Docker File For Multi-Node Training#
Make a Docker image following the Dockerfile below:
1FROM nvcr.io/nvaie/tensorflow:21.07-tf1-py3 2 3ARG DEBIAN_FRONTEND=noninteractiv 4 5# Set MOFED version, OS version and platform 6ENV MOFED_VERSION 5.2-2.2.4.0 7 8#http://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.4.0/MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz 9ENV OS_VERSION ubuntu20.04 10 11ENV PLATFORM x86_64 12 13 14RUN pip3 install --user --upgrade pip && \ 15 pip3 install --no-cache-dir absl-py 16 17RUN apt-get update && \ 18 apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ 19 apt-utils build-essential cmake tcsh tcl tk \ 20 make git curl vim wget ca-certificates \ 21 iputils-ping net-tools ethtool \ 22 perl lsb-release python-libxml2 \ 23 iproute2 pciutils libnl-route-3-200 \ 24 kmod libnuma1 lsof openssh-server \ 25 swig libelf1 automake libglib2.0-0 \ 26 autoconf graphviz chrpath flex libnl-3-200 m4 \ 27 debhelper autotools-dev gfortran libltdl-dev \ 28 dmidecode build-essential cmake git zip pciutils hwloc numactl \ 29 dpatch bison pkg-config numactl dkms udev libnl-route-3-dev libnl-3-dev \ 30 libmnl0 libmnl-dev expect-dev ncat \ 31 usbutils iperf3 bc tree \ 32 quilt \ 33 landscape-common libpci-dev && \ 34 rm -rf /var/lib/apt/lists/* 35# hugepages libgfortran3 netcat 36# linux-headers-$(uname -r) 37 38 39WORKDIR /workspace 40RUN wget http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-$MOFED_VERSION-$OS_VERSION-$PLATFORM.tgz && \ 41 tar -xvf MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}.tgz && \ 42 MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --user-space-only --without-fw-update --force && \ 43 tree /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/ 44 #dpkg -i /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/DEBS/libibumad-dev*.deb && \ 45 #dpkg -i /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/DEBS/libibumad3*.deb 46 47 48# MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --dpdk --upstream-libs --without-fw-update --force --umad-dev-rw -q 49#--user-space-only 50# MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --dpdk --without-fw-update --force -q 51 52#WORKDIR /workspace 53#RUN wget https://www.mellanox.com/downloads/MFT/mft-4.16.1-9-x86_64-deb.tgz && \ 54#tar xzvf mft-4.16.1-9-x86_64-deb.tgz&& \ 55#cd mft-4.16.1-9-x86_64-deb && \ 56#./install.sh 57 58 59WORKDIR /workspace 60RUN git clone -b cnn_tf_v1.15_compatible https://github.com/tensorflow/benchmarks.git 61 62 63WORKDIR /workspace 64RUN git clone https://github.com/NVIDIA/nccl-tests && \ 65cd nccl-tests && \ 66make MPI=1 MPI_HOME=/usr/local/mpi 67 68 69WORKDIR /workspace 70RUN git clone https://github.com/linux-rdma/perftest && \ 71 cd perftest && \ 72 ./autogen.sh && \ 73 CUDA_H_PATH=/usr/local/cuda/include/cuda.h ./configure && \ 74 make install 75 76 77 78WORKDIR /test 79 80 81RUN rm -f ${_CUDA_COMPAT_PATH}/.*.checked
Run the following commands to build the docker multi-node container in the same folder as the Dockerfile:
sudo docker build -t multinode:latest .
Tag and upload the image to your NVIDIA AI Enterprise private registry:
1sudo docker tag multinode <NVIDIA_AI_Enterprise_private_registry_username>/multinode 2sudo docker push
Setup Keyless Entry Between VMs On The Multi-Node Cluster#
On a clean install of a system, the ~/.ssh
directory is typically empty. However, the following files will be generated/added using the steps within this guide:
- id_rsa and id_rsa.pub
SSH keys used for keyless entry between nodes.
- authorized_keys
A list of RSA public keys from other nodes/systems recognized by a server for ssh access.
- config
A file created that provides ssh security key checking settings when accessing other nodes.
- mpicont.sh
A script that we will create to allow mpi to talk between containers on separate nodes.
- ssh_container/
A directory that contains the files above but for internode container communication.
- known_hosts
This file is auto-generated by ssh and lists keys for all hosts that a user has ever connected to.
Generating SSH Keys#
On the master node, we will create a pair of ssh keys shared between the nodes. Then another pair will be generated to use between containers running between the nodes. We will name each set of keys accordingly for this guide, but the default key names id_rsa
and id_rsa.pub
are ok.
Host/WorkerSSH Keys#
Within the command-line terminal, create a new SSH key:
ssh-keygen -t rsa
Enter file in which to save the key /home/nvidia/.ssh/id_rsa):
id_rsa_host
This will generate the following files:
id_rsa_host
id_rsa_host.pub
Container SSH Keys#
Make a directory named
ssh_container
. This directory can be created anywhere, but we will just put it in our~/.ssh
directory for our example:1mkdir ssh_container 2cd ssh_container 3ssh-keygen -t rsa
Enter file in which to save the key (
/home/nvidia/.ssh/id_rsa
):<path/to>/ssh_container/id_rsa_cont
Within the ssh_container
directory, this will generate:
id_rsa_cont
id_rsa_cont.pub
Creating Config Files for Keyless Entry#
In our lab environment, the username is nvidia
for our Ubuntu VMs. Please substitute the username in the following steps to reflect the user in your environment. On the master node, create a file called config
(~/.ssh/config
) and put in the following contents:
1Host *
2 User nvidia
3 IdentityFile ~/.ssh/id_rsa_host
4 StrictHostKeyChecking no
5 UserKnownHostsFile=/dev/null
Within the ssh_container
directory (~/.ssh/ssh_container/config
), create another config
file for the keyless entry between containers:
1Host *
2 User nvidia
3 IdentityFile /root/.ssh/id_rsa_cont
4 StrictHostKeyChecking no
5 UserKnownHostsFile=/dev/null
6 LogLevel=Error
7 ServerAliveInterval=30
Create mpicont.sh script#
Within the in the
~/.ssh
directory, create a script calledmpicont.sh
with the following contents:1mpicont.sh 2docker exec mpicont /bin/bash -c "$SSH_ORIGINAL_COMMAND"
Then make the script executable:
chmod +x mpicont.sh
Copy ~/.ssh
to Worker Nodes and Confirm Keyless entry#
Now we can copy all the files from the master node’s ~/.ssh
directory to all of the worker nodes we specified in our nodelist.
scp -r .ssh $<worker_node_IP>:/home/nvidia/.ssh/;done
Change Permissions in the ssh_container
on all Nodes#
On all the nodes, change the ownership of the ssh_container/config
file so that the owner is root:
sudo chown root:root config
Then change the permissions to 600 for all files in the ssh_container
folder.
sudo chmod 600 *
Below is a list of all the files that were copied over to the workers and their proper permissions:
1~/.ssh$ ll *
2-rw------- 1 nvidia nvidia 894 Jan 24 17:46 authorized_keys
3-rw-r--r-- 1 nvidia nvidia 125 Jan 24 14:21 config
4-rw------- 1 nvidia nvidia 1675 Jan 24 14:19 id_rsa_host
5-rw-r--r-- 1 nvidia nvidia 396 Jan 24 14:19 id_rsa_host.pub
6-rwxrwxr-x 1 nvidia nvidia 57 Jan 24 15:55 mpicont.sh*
ssh_container
:
1total 24
2drwxrwxr-x 2 nvidia nvidia 4096 Feb 6 16:50 ./
3drwxrwxr-x 4 nvidia nvidia 4096 Feb 7 11:29 ../
4-rw------- 1 nvidia nvidia 396 Jan 24 15:58 authorized_keys
5-rw------- 1 root root 161 Jan 24 17:54 config
6-rw------- 1 nvidia nvidia 1675 Jan 24 15:58 id_rsa_cont
7-rw------- 1 nvidia nvidia 396 Jan 24 15:58 id_rsa_cont.pub
Now run Docker containers on all the worker nodes, using the following command:
sudo docker run -it --gpus=all --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --shm-size=1g --name=mpicont --device=/dev/infiniband -v /home/nvidia/.ssh/ssh_container:/root/.ssh <NVIDIA_AI_Enterprise_private_registry_username>/multinode:latest sleep infinity
On the master node, run:
sudo docker run -it --gpus=all --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --shm-size=1g --name=mpicont --device=/dev/infiniband -v /home/nvidia/.ssh/ssh_container:/root/.ssh <NVIDIA_AI_Enterprise_private_registry_username>/multinode:latest /bin/bash
To test if the ssh keyless mpi commands are running, run the following command depending on how many workers you have:
mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4" hostname
To verify available GPUs on all work nodes, run the following command:
mpirun --allow-run-as-root -H <worker1_IP>,<worker2_IP>,<worker3_IP> -np "3" nvidia-smi
Note
Our lab environment, np
(number of processes or, in other words, number of GPUs) parameter is 4. Please modify the np
parameter to reflect your environment.
The output should reflect the hostnames for all four nodes.
Install nv_peer_memory#
On each of the nodes install nv_peer_mem modules.
git clone https://github.com/Mellanox/nv_peer_memory.git
cd nv_peer_memory
./build_module.sh
cd /tmp
tar xzf /tmp/nvidia-peer-memory_1.0.orig.tar.gz
cd nvidia-peer-memory-1.0
dpkg-buildpackage -us -uc
dpkg -i <path to generated deb files>
Run Sample ResNet-50 Multi-Node Training#
Note
Ensure that ssh keyless mpi is running with the command below:
mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4" hostname
Run the following command to test example ResNet-50 multi-node benchmark depending on the worker node count:
mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4 " -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO python3 /workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 512 --use_fp16 --variable_update=horovod --xla=True
Interpreting the Results#
This benchmark reports images per second training performance at each reporting iteration. Use the last few values reported to represent training performance.
1Done warm up
2Step Img/sec total_loss
3Done warm up
4Step Img/sec total_loss
5Done warm up
6Step Img/sec total_loss
7Done warm up
8Step Img/sec total_loss
91 images/sec: 2100.6 +/- 0.0 (jitter = 0.0) 7.738
101 images/sec: 2100.8 +/- 0.0 (jitter = 0.0) 7.742
111 images/sec: 2100.2 +/- 0.0 (jitter = 0.0) 7.734
121 images/sec: 2100.8 +/- 0.0 (jitter = 0.0) 7.770
1310 images/sec: 2100.0 +/- 61.9 (jitter = 6.6) 7.607
1410 images/sec: 2100.4 +/- 60.4 (jitter = 189.7) 7.656
1510 images/sec: 2100.9 +/- 59.2 (jitter = 88.7) 7.611
1610 images/sec: 2100.9 +/- 59.0 (jitter = 175.8) 7.647
1720 images/sec: 2100.2 +/- 39.4 (jitter = 92.3) 7.527
1820 images/sec: 2100.2 +/- 43.8 (jitter = 198.3) 7.515
1920 images/sec: 2100.1 +/- 41.1 (jitter = 181.8) 7.512
2020 images/sec: 2100.1 +/- 43.0 (jitter = 14.7) 7.501
2130 images/sec: 2100.9 +/- 34.9 (jitter = 198.3) 7.490
2230 images/sec: 2100.4 +/- 35.3 (jitter = 11.1) 7.474
2330 images/sec: 2100.7 +/- 33.3 (jitter = 92.9) 7.483
2430 images/sec: 2100.3 +/- 34.9 (jitter = 157.3) 7.493
2540 images/sec: 2100.5 +/- 28.3 (jitter = 76.4) 7.476
2640 images/sec: 2100.9 +/- 31.2 (jitter = 193.8) 7.476
2740 images/sec: 2100.5 +/- 31.2 (jitter = 186.9) 7.483
2840 images/sec: 2100.2 +/- 31.5 (jitter = 18.9) 7.474
2950 images/sec: 2100.8 +/- 28.1 (jitter = 15.0) 7.480
3050 images/sec: 2100.3 +/- 28.3 (jitter = 168.8) 7.468
3150 images/sec: 2100.7 +/- 25.7 (jitter = 76.4) 7.485
3250 images/sec: 2100.2 +/- 27.4 (jitter = 218.1) 7.485
3360 images/sec: 2100.2 +/- 25.6 (jitter = 173.0) 7.485
3460 images/sec: 2100.3 +/- 23.3 (jitter = 66.1) 7.501
3560 images/sec: 2100.4 +/- 24.8 (jitter = 190.7) 7.480
3660 images/sec: 2100.2 +/- 26.4 (jitter = 20.6) 7.493
3770 images/sec: 2100.4 +/- 24.3 (jitter = 16.4) 7.495
3870 images/sec: 2100.4 +/- 23.9 (jitter = 157.3) 7.498
3970 images/sec: 2100.0 +/- 22.1 (jitter = 52.3) 7.503
4070 images/sec: 2100.5 +/- 23.4 (jitter = 218.3) 7.509
4180 images/sec: 2100.3 +/- 22.4 (jitter = 157.3) 7.490
4280 images/sec: 2100.2 +/- 20.6 (jitter = 50.7) 7.510
4380 images/sec: 2100.6 +/- 21.7 (jitter = 195.2) 7.520
4480 images/sec: 2100.2 +/- 22.4 (jitter = 30.3) 7.508
4590 images/sec: 2100.8 +/- 21.2 (jitter = 22.3) 7.481
4690 images/sec: 2100.1 +/- 20.8 (jitter = 157.3) 7.489
4790 images/sec: 2100.7 +/- 19.7 (jitter = 35.1) 7.496
4890 images/sec: 2100.7 +/- 20.7 (jitter = 218.1) 7.471
49100 images/sec: 2100.2 +/- 20.2 (jitter = 30.3) 7.501
50----------------------------------------------------------------
51total images/sec: 8400.46
52----------------------------------------------------------------
53100 images/sec: 1520.1 +/- 19.9 (jitter = 166.6) 7.522
54----------------------------------------------------------------
55total images/sec: 8400.99
56----------------------------------------------------------------
57100 images/sec: 1517.6 +/- 18.6 (jitter = 52.3) 7.507
58----------------------------------------------------------------
59total images/sec: 8400.84
60----------------------------------------------------------------
61100 images/sec: 1517.9 +/- 19.6 (jitter = 219.0) 7.500
62----------------------------------------------------------------
63total images/sec: 8400.58
64----------------------------------------------------------------