To leverage RDMA with ATS for high performance compute, the following steps are outlined within this guide:
Configure NVIDIA ConnectX-6 Dx for RoCE
Enable ATS on VMware ESXi and Virtual Machines
Enable ATS on NVIDIA Connect X-6 DX NIC
Configure NUMA Affinity
Creating Docker File for Multi-Node Training
Setup Keyless Entry between VM’s on the multi-node cluster
Run Sample ResNet-50 Multi-Node Training
To leverage RoCE, the NVIDIA ConnectX-6 Dx NIC must run RoCE over a loss network in DSCP-based QoS mode. The following knowledge Article is a helpful resource for applying this configuration: https://community.mellanox.com/s/article/lossless-roce-configuration-for-mlnx-os-switches-in-dscp-based-qos-mode
For this guide, we will reference configuration steps within the Knowledge Article for version 3.8.2008 and above.
Run the following commands on the NVIDIA switch:
switch (config) # roce
NoteThe RoCE feature has been Automated so that all that is needed to run RoCE on lossless fabric is running the
roce
command.
Create an isolated vLAN and place the NVIDIA ConnectX NICs into the created vLAN as access ports. The four servers connected into switch ports 1/1 - 1/4.
switch (config) # interface vlan 111 switch (config vlan 111) # exit switch (config) # interface ethernet 1/1-1/4 switchport access vlan 111
Set the MTU to 9216 on the interfaces (on versions below 3.9.2110, the switch’s default MTU is 1500).
switch (config) # interface ethernet 1/1-1/4 shutdown switch (config) # interface ethernet 1/1-1/4 mtu 9216 switch (config) # interface ethernet 1/1-1/4 no shutdown
Optional, if you are running Cumulus Linux, follow these instructions to enable RoCE: https://docs.cumulusnetworks.com/cumulus-linux-42/Network-Solutions/RDMA-over-Converged-Ethernet-RoCE/.
To enable Peer-2-Peer (P2P) with high performance, we will enable ATS by updating the VMKernel and then the VM configuration.
Update the VMKernel for Peer-2-Peer (P2P).
To enable the ATS boot option, invoke the following command and reboot ESXi:
esxcli system settings kernel set -s atsSupport -v TRUE
Verify the value is correct after reboot, invoke:
esxcli system settings kernel list -o atsSupport
The output should resemble the following:
Name Type Configured Runtime Default Description ------------ ------- ---------- ------- ------- ----------- atsSupport Bool TRUE TRUE FALSE Enable Support for PCIe ATS
Update the VM configuration for P2P.
Edit the VM configuration settings:
pciPassthru.allowP2P=true # enable P2P pciPassthru.RelaxACSforP2P=true # update ACS capabilities in switch
NoteWhen relaxing ACS for P2P is enabled, VMware will locate an ATS capable passthrough device, find its parent switch, and enable the ACS Direct Translated bit. The previous restriction that all functions of peer networking devices must be given to a single VM has been removed. Each function of a peer device can be given to a separate VM.
If there are multiple GPU physical devices, the VM can specify a specific device for P2P with existing config:
pciPassthru0.cfg.gpu-pci-id = "ssss:bb:dd.f"
NoteThe
gpu-pci-id
is in hex SBDF format. If the GPU is in SR-IOV mode, you should specify a VF address.
Install python 2.7 with the command below:
sudo apt-get install python
Download and install MLNX OFED 5.0: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/.
Select (OS/Version/Architecture) and download tar file, for example: (Ubuntu/20.04/x86_64).
Download, then copy the package to the VMs, and run the following commands to extract and install:
tar xvf MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz cd MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz sudo ./mlnxofedinstall
NoteThe above step will also update firmware for all CX5 or CX6 cards.
Run the following command after the install is complete:
sudo /etc/init.d/openibd restart
NoteDuring the install process, the CX-6 NICs are detected, and OFED should update the firmware. If this fails, download the latest firmware and update manually. Repeat the OFED install after.
Check OFED and Firmware versions using the following commands:
dpkg -l | grep mlnx-ofed cat /sys/class/infiniband/mlx5*/fw_ver
Start Mellanox software tools:
sudo mst start
Check the status of the ATS_ENABLED Configuration for the CX-6 NIC using the below command. You should see output similar to the following:
sudo mlxconfig -d /dev/mst/mt4123_pciconf0 query | grep -i ATS ATS_ENABLED False(0)
If it is not present, the firmware does not support ATS. Update to a version of the firmware that does. If set to False, use the following command to enable ATS:
sudo mlxconfig -d /dev/mst/mt4123_pciconf0 set ATS_ENABLED=true Device #1: ---------- Device type: ConnectX6 Name: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 Device: /dev/mst/mt4123_pciconf0 Configurations: Next Boot New ATS_ENABLED False(0) True(1) Apply new Configuration? (y/n) [n] : y Applying... Done! -I- Please reboot machine to load new configurations.
Once you have enabled ATS on the CX-6 on both VMs, put the host in maintenance mode and reboot the ESXi host.
NoteIf you have vMotion configured between two hosts, then VMs on a host can move to another running host while the necessary reboots occur to enable ATS.
NoteRemember to re-submit the command to enable the ACS Direct Translated bit on the PCIe switch.
After the ESXi host reboot is complete, power back on vCenter and the VMs.
Next, verify that ATS is enabled on VMs by running the following commands:
sudo mst start sudo mlxconfig -d /dev/mst/mt4123_pciconf0 query | grep -i ATS sudo lspci -vvv
Search for the Mellanox CX-6 device, and verify the output contains the ATS Capability as configured below:
Capabilities: [480 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00
NoteEnable+ indicates it’s been successfully enabled.
Check which NUMA node your NICs and GPUs are attached to, run the following command on the ESXi host:
esxcli hardware pci list | grep -A 30 -B 10 NVIDIA esxcli hardware pci list | grep -A 30 -B 10 Mellanox
The following output describes the devices NUMA node:
0000:3b:02.3 Address: 0000:3b:02.3 Segment: 0x0000 Bus: 0x3b Slot: 0x02 Function: 0x3 VMkernel Name: PF_0.59.0_VF_15 Vendor Name: NVIDIA Corporation Device Name: NVIDIAA100-PCIE-40GB Configured Owner: VMkernel Current Owner: VMkernel Vendor ID: 0x10de Device ID: 0x20f1 SubVendor ID: 0x10de SubDevice ID: 0x0000 Device Class: 0x0302 Device Class Name: 3D controller Programming Interface: 0x00 Revision ID: 0xa1 Interrupt Line: 0xff IRQ: 255 Interrupt Vector: 0x00 PCI Pin: 0xff Spawned Bus: 0x00 Flags: 0x0001 Module ID: 54 Module Name: nvidia Chassis: 0 Physical Slot: -1 Slot Description: Device Layer Bus Address: s00000001.00.vf15 Passthru Capable: true Parent Device: PCI 0:58:0:0 Dependent Device: PCI 0:59:2:3 Reset Method: Function reset FPT Sharable: true NUMA Node: 0 Extended Device ID: 65535 Extended Device Name:
Make sure the NIC and the GPU are on the same NUMA node.
Within the VM configuration, add a new key-value:
numa.nodeAffinity = <numa node value>
Creating Docker File For Multi-Node Training
Make a Docker image following the Dockerfile below:
FROM nvcr.io/nvaie/tensorflow:21.07-tf1-py3 ARG DEBIAN_FRONTEND=noninteractiv # Set MOFED version, OS version and platform ENV MOFED_VERSION 5.2-2.2.4.0 #http://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.4.0/MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz ENV OS_VERSION ubuntu20.04 ENV PLATFORM x86_64 RUN pip3 install --user --upgrade pip && \ pip3 install --no-cache-dir absl-py RUN apt-get update && \ apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ apt-utils build-essential cmake tcsh tcl tk \ make git curl vim wget ca-certificates \ iputils-ping net-tools ethtool \ perl lsb-release python-libxml2 \ iproute2 pciutils libnl-route-3-200 \ kmod libnuma1 lsof openssh-server \ swig libelf1 automake libglib2.0-0 \ autoconf graphviz chrpath flex libnl-3-200 m4 \ debhelper autotools-dev gfortran libltdl-dev \ dmidecode build-essential cmake git zip pciutils hwloc numactl \ dpatch bison pkg-config numactl dkms udev libnl-route-3-dev libnl-3-dev \ libmnl0 libmnl-dev expect-dev ncat \ usbutils iperf3 bc tree \ quilt \ landscape-common libpci-dev && \ rm -rf /var/lib/apt/lists/* # hugepages libgfortran3 netcat # linux-headers-$(uname -r) WORKDIR /workspace RUN wget http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-$MOFED_VERSION-$OS_VERSION-$PLATFORM.tgz && \ tar -xvf MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}.tgz && \ MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --user-space-only --without-fw-update --force && \ tree /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/ #dpkg -i /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/DEBS/libibumad-dev*.deb && \ #dpkg -i /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/DEBS/libibumad3*.deb # MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --dpdk --upstream-libs --without-fw-update --force --umad-dev-rw -q #--user-space-only # MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --dpdk --without-fw-update --force -q #WORKDIR /workspace #RUN wget https://www.mellanox.com/downloads/MFT/mft-4.16.1-9-x86_64-deb.tgz && \ #tar xzvf mft-4.16.1-9-x86_64-deb.tgz&& \ #cd mft-4.16.1-9-x86_64-deb && \ #./install.sh WORKDIR /workspace RUN git clone -b cnn_tf_v1.15_compatible https://github.com/tensorflow/benchmarks.git WORKDIR /workspace RUN git clone https://github.com/NVIDIA/nccl-tests && \ cd nccl-tests && \ make MPI=1 MPI_HOME=/usr/local/mpi WORKDIR /workspace RUN git clone https://github.com/linux-rdma/perftest && \ cd perftest && \ ./autogen.sh && \ CUDA_H_PATH=/usr/local/cuda/include/cuda.h ./configure && \ make install WORKDIR /test RUN rm -f ${_CUDA_COMPAT_PATH}/.*.checked
Run the following commands to build the docker multi-node container in the same folder as the Dockerfile:
sudo docker build -t multinode:latest .
Tag and upload the image to your NVIDIA AI Enterprise private registry:
sudo docker tag multinode <NVIDIA_AI_Enterprise_private_registry_username>/multinode sudo docker push
On a clean install of a system, the ~/.ssh
directory is typically empty. However, the following files will be generated/added using the steps within this guide:
- id_rsa and id_rsa.pub
SSH keys used for keyless entry between nodes.
- authorized_keys
A list of RSA public keys from other nodes/systems recognized by a server for ssh access.
- config
A file created that provides ssh security key checking settings when accessing other nodes.
- mpicont.sh
A script that we will create to allow mpi to talk between containers on separate nodes.
- ssh_container/
A directory that contains the files above but for internode container communication.
- known_hosts
This file is auto-generated by ssh and lists keys for all hosts that a user has ever connected to.
Generating SSH Keys
On the master node, we will create a pair of ssh keys shared between the nodes. Then another pair will be generated to use between containers running between the nodes. We will name each set of keys accordingly for this guide, but the default key names id_rsa
and id_rsa.pub
are ok.
Host/WorkerSSH Keys
Within the command-line terminal, create a new SSH key:
ssh-keygen -t rsa
Enter file in which to save the key /home/nvidia/.ssh/id_rsa):
id_rsa_host
This will generate the following files:
id_rsa_host
id_rsa_host.pub
Container SSH Keys
Make a directory named
ssh_container
. This directory can be created anywhere, but we will just put it in our~/.ssh
directory for our example:mkdir ssh_container cd ssh_container ssh-keygen -t rsa
Enter file in which to save the key (
/home/nvidia/.ssh/id_rsa
):<path/to>/ssh_container/id_rsa_cont
Within the ssh_container
directory, this will generate:
id_rsa_cont
id_rsa_cont.pub
Creating Config Files for Keyless Entry
In our lab environment, the username is nvidia
for our Ubuntu VMs. Please substitute the username in the following steps to reflect the user in your environment. On the master node, create a file called config
(~/.ssh/config
) and put in the following contents:
Host *
User nvidia
IdentityFile ~/.ssh/id_rsa_host
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Within the ssh_container
directory (~/.ssh/ssh_container/config
), create another config
file for the keyless entry between containers:
Host *
User nvidia
IdentityFile /root/.ssh/id_rsa_cont
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
LogLevel=Error
ServerAliveInterval=30
Add Public SSH Keys To “Authorized_keys”
For keyless entry to work on the workernodes, the contents of the public ssh keys need to be copied to the authorized_keys
file for both internode communications and communications between the containers on separate nodes.
In the ~/.ssh
folder:
echo `cat id_rsa_host.pub` > authorized_keys
In the ~/.ssh/ssh_container
folder:
echo `cat id_rsa_cont.pub` > authorized_keys
Create mpicont.sh script
Within the in the
~/.ssh
directory, create a script calledmpicont.sh
with the following contents:mpicont.sh docker exec mpicont /bin/bash -c "$SSH_ORIGINAL_COMMAND"
Then make the script executable:
chmod +x mpicont.sh
Add Container SSH Key to the Master’s authorized_keys
File
Add the following line to the master authorized_keys
file:
command="bash /home/nvidia/.ssh/mpicont.sh",no-port-forwarding,no-agent-forwarding,no-X11-forwarding <add contents of id_rsa_cont.pub>
Copy ~/.ssh
to Worker Nodes and Confirm Keyless entry
Now we can copy all the files from the master node’s ~/.ssh
directory to all of the worker nodes we specified in our nodelist.
scp -r .ssh $<worker_node_IP>:/home/nvidia/.ssh/;done
Change Permissions in the ssh_container
on all Nodes
On all the nodes, change the ownership of the ssh_container/config
file so that the owner is root:
sudo chown root:root config
Then change the permissions to 600 for all files in the ssh_container
folder.
sudo chmod 600 *
Below is a list of all the files that were copied over to the workers and their proper permissions:
~/.ssh$ ll *
-rw------- 1 nvidia nvidia 894 Jan 24 17:46 authorized_keys
-rw-r--r-- 1 nvidia nvidia 125 Jan 24 14:21 config
-rw------- 1 nvidia nvidia 1675 Jan 24 14:19 id_rsa_host
-rw-r--r-- 1 nvidia nvidia 396 Jan 24 14:19 id_rsa_host.pub
-rwxrwxr-x 1 nvidia nvidia 57 Jan 24 15:55 mpicont.sh*
ssh_container
:
total 24
drwxrwxr-x 2 nvidia nvidia 4096 Feb 6 16:50 ./
drwxrwxr-x 4 nvidia nvidia 4096 Feb 7 11:29 ../
-rw------- 1 nvidia nvidia 396 Jan 24 15:58 authorized_keys
-rw------- 1 root root 161 Jan 24 17:54 config
-rw------- 1 nvidia nvidia 1675 Jan 24 15:58 id_rsa_cont
-rw------- 1 nvidia nvidia 396 Jan 24 15:58 id_rsa_cont.pub
Now run Docker containers on all the worker nodes, using the following command:
sudo docker run -it --gpus=all --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --shm-size=1g --name=mpicont --device=/dev/infiniband -v /home/nvidia/.ssh/ssh_container:/root/.ssh <NVIDIA_AI_Enterprise_private_registry_username>/multinode:latest sleep infinity
On the master node, run:
sudo docker run -it --gpus=all --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --shm-size=1g --name=mpicont --device=/dev/infiniband -v /home/nvidia/.ssh/ssh_container:/root/.ssh <NVIDIA_AI_Enterprise_private_registry_username>/multinode:latest /bin/bash
To test if the ssh keyless mpi commands are running, run the following command depending on how many workers you have:
mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4" hostname
To verify available GPUs on all work nodes, run the following command:
mpirun --allow-run-as-root -H <worker1_IP>,<worker2_IP>,<worker3_IP> -np "3" nvidia-smi
Our lab environment, np
(number of processes or, in other words, number of GPUs) parameter is 4. Please modify the np
parameter to reflect your environment.
The output should reflect the hostnames for all four nodes.
Install nv_peer_memory
On each of the nodes install nv_peer_mem modules.
git clone https://github.com/Mellanox/nv_peer_memory.git
cd nv_peer_memory
./build_module.sh
cd /tmp
tar xzf /tmp/nvidia-peer-memory_1.0.orig.tar.gz
cd nvidia-peer-memory-1.0
dpkg-buildpackage -us -uc
dpkg -i <path to generated deb files>
Ensure that ssh keyless mpi is running with the command below:
mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4" hostname
Run the following command to test example ResNet-50 multi-node benchmark depending on the worker node count:
mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4 " -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO python3 /workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 512 --use_fp16 --variable_update=horovod --xla=True
Interpreting the Results
This benchmark reports images per second training performance at each reporting iteration. Use the last few values reported to represent training performance.
Done warm up
Step Img/sec total_loss
Done warm up
Step Img/sec total_loss
Done warm up
Step Img/sec total_loss
Done warm up
Step Img/sec total_loss
1 images/sec: 2100.6 +/- 0.0 (jitter = 0.0) 7.738
1 images/sec: 2100.8 +/- 0.0 (jitter = 0.0) 7.742
1 images/sec: 2100.2 +/- 0.0 (jitter = 0.0) 7.734
1 images/sec: 2100.8 +/- 0.0 (jitter = 0.0) 7.770
10 images/sec: 2100.0 +/- 61.9 (jitter = 6.6) 7.607
10 images/sec: 2100.4 +/- 60.4 (jitter = 189.7) 7.656
10 images/sec: 2100.9 +/- 59.2 (jitter = 88.7) 7.611
10 images/sec: 2100.9 +/- 59.0 (jitter = 175.8) 7.647
20 images/sec: 2100.2 +/- 39.4 (jitter = 92.3) 7.527
20 images/sec: 2100.2 +/- 43.8 (jitter = 198.3) 7.515
20 images/sec: 2100.1 +/- 41.1 (jitter = 181.8) 7.512
20 images/sec: 2100.1 +/- 43.0 (jitter = 14.7) 7.501
30 images/sec: 2100.9 +/- 34.9 (jitter = 198.3) 7.490
30 images/sec: 2100.4 +/- 35.3 (jitter = 11.1) 7.474
30 images/sec: 2100.7 +/- 33.3 (jitter = 92.9) 7.483
30 images/sec: 2100.3 +/- 34.9 (jitter = 157.3) 7.493
40 images/sec: 2100.5 +/- 28.3 (jitter = 76.4) 7.476
40 images/sec: 2100.9 +/- 31.2 (jitter = 193.8) 7.476
40 images/sec: 2100.5 +/- 31.2 (jitter = 186.9) 7.483
40 images/sec: 2100.2 +/- 31.5 (jitter = 18.9) 7.474
50 images/sec: 2100.8 +/- 28.1 (jitter = 15.0) 7.480
50 images/sec: 2100.3 +/- 28.3 (jitter = 168.8) 7.468
50 images/sec: 2100.7 +/- 25.7 (jitter = 76.4) 7.485
50 images/sec: 2100.2 +/- 27.4 (jitter = 218.1) 7.485
60 images/sec: 2100.2 +/- 25.6 (jitter = 173.0) 7.485
60 images/sec: 2100.3 +/- 23.3 (jitter = 66.1) 7.501
60 images/sec: 2100.4 +/- 24.8 (jitter = 190.7) 7.480
60 images/sec: 2100.2 +/- 26.4 (jitter = 20.6) 7.493
70 images/sec: 2100.4 +/- 24.3 (jitter = 16.4) 7.495
70 images/sec: 2100.4 +/- 23.9 (jitter = 157.3) 7.498
70 images/sec: 2100.0 +/- 22.1 (jitter = 52.3) 7.503
70 images/sec: 2100.5 +/- 23.4 (jitter = 218.3) 7.509
80 images/sec: 2100.3 +/- 22.4 (jitter = 157.3) 7.490
80 images/sec: 2100.2 +/- 20.6 (jitter = 50.7) 7.510
80 images/sec: 2100.6 +/- 21.7 (jitter = 195.2) 7.520
80 images/sec: 2100.2 +/- 22.4 (jitter = 30.3) 7.508
90 images/sec: 2100.8 +/- 21.2 (jitter = 22.3) 7.481
90 images/sec: 2100.1 +/- 20.8 (jitter = 157.3) 7.489
90 images/sec: 2100.7 +/- 19.7 (jitter = 35.1) 7.496
90 images/sec: 2100.7 +/- 20.7 (jitter = 218.1) 7.471
100 images/sec: 2100.2 +/- 20.2 (jitter = 30.3) 7.501
----------------------------------------------------------------
total images/sec: 8400.46
----------------------------------------------------------------
100 images/sec: 1520.1 +/- 19.9 (jitter = 166.6) 7.522
----------------------------------------------------------------
total images/sec: 8400.99
----------------------------------------------------------------
100 images/sec: 1517.6 +/- 18.6 (jitter = 52.3) 7.507
----------------------------------------------------------------
total images/sec: 8400.84
----------------------------------------------------------------
100 images/sec: 1517.9 +/- 19.6 (jitter = 219.0) 7.500
----------------------------------------------------------------
total images/sec: 8400.58
----------------------------------------------------------------