Getting Started#

To leverage RDMA with ATS for high performance compute, the following steps are outlined within this guide:

Configure NVIDIA ConnectX-6 Dx for RoCE
Enable ATS on VMware ESXi and Virtual Machines
Enable ATS on NVIDIA Connect X-6 DX NIC
Configure NUMA Affinity
Creating Docker File for Multi-Node Training
Setup Keyless Entry between VM’s on the multi-node cluster
Run Sample ResNet-50 Multi-Node Training

Configure NVIDIA ConnectX-6 Dx NIC and Spectrum switch for RoCE#

To leverage RoCE, the NVIDIA ConnectX-6 Dx NIC must run RoCE over a loss network in DSCP-based QoS mode. The following knowledge Article is a helpful resource for applying this configuration: https://community.mellanox.com/s/article/lossless-roce-configuration-for-mlnx-os-switches-in-dscp-based-qos-mode

For this guide, we will reference configuration steps within the Knowledge Article for version 3.8.2008 and above.

Run the following commands on the NVIDIA switch:
switch (config) # roce
Note

The RoCE feature has been Automated so that all that is needed to run RoCE on lossless fabric is running the roce command.

Create an isolated vLAN and place the NVIDIA ConnectX NICs into the created vLAN as access ports. The four servers connected into switch ports 1/1 - 1/4.

switch (config) # interface vlan 111
switch (config vlan 111) # exit
switch (config) # interface ethernet 1/1-1/4 switchport access vlan 111

Set the MTU to 9216 on the interfaces (on versions below 3.9.2110, the switch’s default MTU is 1500).

switch (config) # interface ethernet 1/1-1/4 shutdown
switch (config) # interface ethernet 1/1-1/4 mtu 9216
switch (config) # interface ethernet 1/1-1/4 no shutdown

Optional, if you are running Cumulus Linux, follow these instructions to enable RoCE: https://docs.cumulusnetworks.com/cumulus-linux-42/Network-Solutions/RDMA-over-Converged-Ethernet-RoCE/.

Enable ATS on VMware ESXi and VMs#

To enable Peer-2-Peer (P2P) with high performance, we will enable ATS by updating the VMKernel and then the VM configuration.

Update the VMKernel for Peer-2-Peer (P2P).

To enable the ATS boot option, invoke the following command and reboot ESXi:
esxcli system settings kernel set -s atsSupport -v TRUE

Verify the value is correct after reboot, invoke:

esxcli system settings kernel list -o atsSupport

The output should resemble the following:

Name          Type     Configured  Runtime   Default  Description
------------  -------  ----------  -------   -------  -----------
atsSupport    Bool     TRUE        TRUE      FALSE    Enable Support for PCIe ATS

Update the VM configuration for P2P.
Edit the VM configuration settings:
1pciPassthru.allowP2P=true # enable P2P 2pciPassthru.RelaxACSforP2P=true # update ACS capabilities in switch
Note

When relaxing ACS for P2P is enabled, VMware will locate an ATS capable passthrough device, find its parent switch, and enable the ACS Direct Translated bit. The previous restriction that all functions of peer networking devices must be given to a single VM has been removed. Each function of a peer device can be given to a separate VM.
If there are multiple GPU physical devices, the VM can specify a specific device for P2P with existing config:
pciPassthru0.cfg.gpu-pci-id = "ssss:bb:dd.f"
Note

The gpu-pci-id is in hex SBDF format. If the GPU is in SR-IOV mode, you should specify a VF address.

Enable ATS on the NVIDIA ConnectX-6 Dx NIC#

Install python 2.7 with the command below:
sudo apt-get install python
Download and install MLNX OFED 5.0: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/.
- Select (OS/Version/Architecture) and download tar file, for example: (Ubuntu/20.04/x86_64).
- Download, then copy the package to the VMs, and run the following commands to extract and install:
  1tar xvf MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz 2cd MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz 3sudo ./mlnxofedinstall
  
  Note
  
  The above step will also update firmware for all CX5 or CX6 cards.
- Run the following command after the install is complete:
  sudo /etc/init.d/openibd restart
  
  Note
  
  During the install process, the CX-6 NICs are detected, and OFED should update the firmware. If this fails, download the latest firmware and update manually. Repeat the OFED install after.

Check OFED and Firmware versions using the following commands:

dpkg -l | grep mlnx-ofed
cat /sys/class/infiniband/mlx5*/fw_ver

Start Mellanox software tools:
sudo mst start

Check the status of the ATS_ENABLED Configuration for the CX-6 NIC using the below command. You should see output similar to the following:

sudo mlxconfig -d /dev/mst/mt4123_pciconf0 query | grep -i ATS
ATS_ENABLED                         False(0)

If it is not present, the firmware does not support ATS. Update to a version of the firmware that does. If set to False, use the following command to enable ATS:

sudo mlxconfig -d /dev/mst/mt4123_pciconf0 set ATS_ENABLED=true
Device #1:
----------
Device type:    ConnectX6
Name:           MCX653105A-HDA_Ax
Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device:        /dev/mst/mt4123_pciconf0

Configurations:           Next Boot     New
ATS_ENABLED               False(0)      True(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

Once you have enabled ATS on the CX-6 on both VMs, put the host in maintenance mode and reboot the ESXi host.

Note

If you have vMotion configured between two hosts, then VMs on a host can move to another running host while the necessary reboots occur to enable ATS.

Note

Remember to re-submit the command to enable the ACS Direct Translated bit on the PCIe switch.
After the ESXi host reboot is complete, power back on vCenter and the VMs.

Next, verify that ATS is enabled on VMs by running the following commands:

sudo mst start
sudo mlxconfig -d /dev/mst/mt4123_pciconf0 query | grep -i ATS
sudo lspci -vvv

Search for the Mellanox CX-6 device, and verify the output contains the ATS Capability as configured below:

Capabilities: [480 v1] Address Translation Service (ATS)
    ATSCap: Invalidate Queue Depth: 00
     ATSCtl: Enable+, Smallest Translation Unit: 00

Note

Enable+ indicates it’s been successfully enabled.

Configure NUMA Affinity for the VMs#

Check which NUMA node your NICs and GPUs are attached to, run the following command on the ESXi host:

esxcli hardware pci list | grep -A 30 -B 10 NVIDIA
esxcli hardware pci list | grep -A 30 -B 10 Mellanox

The following output describes the devices NUMA node:

0000:3b:02.3
    Address: 0000:3b:02.3
    Segment: 0x0000
    Bus: 0x3b
    Slot: 0x02
    Function: 0x3
    VMkernel Name: PF_0.59.0_VF_15
    Vendor Name: NVIDIA Corporation
    Device Name: NVIDIAA100-PCIE-40GB
    Configured Owner: VMkernel
    Current Owner: VMkernel
    Vendor ID: 0x10de
    Device ID: 0x20f1
    SubVendor ID: 0x10de
    SubDevice ID: 0x0000
    Device Class: 0x0302
    Device Class Name: 3D controller
    Programming Interface: 0x00
    Revision ID: 0xa1
    Interrupt Line: 0xff
    IRQ: 255
    Interrupt Vector: 0x00
PCI Pin: 0xff
    Spawned Bus: 0x00
    Flags: 0x0001
    Module ID: 54
    Module Name: nvidia
    Chassis: 0
    Physical Slot: -1
    Slot Description:
    Device Layer Bus Address: s00000001.00.vf15
    Passthru Capable: true
    Parent Device: PCI 0:58:0:0
    Dependent Device: PCI 0:59:2:3
    Reset Method: Function reset
    FPT Sharable: true
    NUMA Node: 0
    Extended Device ID: 65535
    Extended Device Name:

Make sure the NIC and the GPU are on the same NUMA node.
Within the VM configuration, add a new key-value:
numa.nodeAffinity = <numa node value>

Creating Docker File For Multi-Node Training#

Make a Docker image following the Dockerfile below:

FROM nvcr.io/nvaie/tensorflow:21.07-tf1-py3

ARG DEBIAN_FRONTEND=noninteractiv

# Set MOFED version, OS version and platform
ENV MOFED_VERSION 5.2-2.2.4.0

#http://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.4.0/MLNX_OFED_LINUX-5.2-2.2.4.0-ubuntu20.04-x86_64.tgz
ENV OS_VERSION ubuntu20.04

ENV PLATFORM x86_64


RUN pip3 install --user --upgrade pip && \
    pip3 install --no-cache-dir absl-py

RUN apt-get update && \
    apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
        apt-utils build-essential cmake tcsh tcl tk \
        make git curl vim wget ca-certificates \
        iputils-ping net-tools ethtool \
        perl lsb-release python-libxml2 \
        iproute2 pciutils libnl-route-3-200 \
        kmod libnuma1 lsof openssh-server \
        swig libelf1 automake libglib2.0-0 \
        autoconf graphviz chrpath flex libnl-3-200 m4 \
        debhelper autotools-dev gfortran libltdl-dev  \
        dmidecode build-essential cmake git zip pciutils hwloc  numactl \
        dpatch bison pkg-config numactl  dkms udev libnl-route-3-dev libnl-3-dev  \
        libmnl0 libmnl-dev expect-dev ncat \
        usbutils iperf3 bc tree \
        quilt  \
        landscape-common  libpci-dev && \
        rm -rf /var/lib/apt/lists/*
# hugepages libgfortran3 netcat
# linux-headers-$(uname -r)


WORKDIR /workspace
RUN wget http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-$MOFED_VERSION-$OS_VERSION-$PLATFORM.tgz && \
    tar -xvf MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}.tgz && \
    MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --user-space-only --without-fw-update --force && \
    tree /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/
    #dpkg -i /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/DEBS/libibumad-dev*.deb && \
    #dpkg -i /workspace/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/DEBS/libibumad3*.deb


#    MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --dpdk --upstream-libs --without-fw-update --force --umad-dev-rw -q
#--user-space-only
#    MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --dpdk --without-fw-update --force  -q

#WORKDIR /workspace
#RUN wget https://www.mellanox.com/downloads/MFT/mft-4.16.1-9-x86_64-deb.tgz && \
#tar xzvf mft-4.16.1-9-x86_64-deb.tgz&& \
#cd mft-4.16.1-9-x86_64-deb && \
#./install.sh


WORKDIR /workspace
RUN git clone -b cnn_tf_v1.15_compatible https://github.com/tensorflow/benchmarks.git


WORKDIR /workspace
RUN git clone https://github.com/NVIDIA/nccl-tests && \
cd nccl-tests && \
make MPI=1 MPI_HOME=/usr/local/mpi


WORKDIR /workspace
RUN git clone https://github.com/linux-rdma/perftest && \
    cd perftest && \
    ./autogen.sh && \
    CUDA_H_PATH=/usr/local/cuda/include/cuda.h ./configure && \
    make install



WORKDIR /test


RUN rm -f ${_CUDA_COMPAT_PATH}/.*.checked

Run the following commands to build the docker multi-node container in the same folder as the Dockerfile:
sudo docker build -t multinode:latest .

Tag and upload the image to your NVIDIA AI Enterprise private registry:

sudo docker tag multinode <NVIDIA_AI_Enterprise_private_registry_username>/multinode
sudo docker push

Setup Keyless Entry Between VMs On The Multi-Node Cluster#

On a clean install of a system, the ~/.ssh directory is typically empty. However, the following files will be generated/added using the steps within this guide:

id_rsa and id_rsa.pub: SSH keys used for keyless entry between nodes.
authorized_keys: A list of RSA public keys from other nodes/systems recognized by a server for ssh access.
config: A file created that provides ssh security key checking settings when accessing other nodes.
mpicont.sh: A script that we will create to allow mpi to talk between containers on separate nodes.
ssh_container/: A directory that contains the files above but for internode container communication.
known_hosts: This file is auto-generated by ssh and lists keys for all hosts that a user has ever connected to.

Generating SSH Keys#

On the master node, we will create a pair of ssh keys shared between the nodes. Then another pair will be generated to use between containers running between the nodes. We will name each set of keys accordingly for this guide, but the default key names id_rsa and id_rsa.pub are ok.

Host/WorkerSSH Keys#

Within the command-line terminal, create a new SSH key:
ssh-keygen -t rsa
Enter file in which to save the key /home/nvidia/.ssh/id_rsa):
id_rsa_host

This will generate the following files:

id_rsa_host
id_rsa_host.pub

Container SSH Keys#

Make a directory named ssh_container. This directory can be created anywhere, but we will just put it in our ~/.ssh directory for our example:
1mkdir ssh_container 2cd ssh_container 3ssh-keygen -t rsa
Enter file in which to save the key (/home/nvidia/.ssh/id_rsa):
<path/to>/ssh_container/id_rsa_cont

Within the ssh_container directory, this will generate:

id_rsa_cont
id_rsa_cont.pub

Creating Config Files for Keyless Entry#

In our lab environment, the username is nvidia for our Ubuntu VMs. Please substitute the username in the following steps to reflect the user in your environment. On the master node, create a file called config (~/.ssh/config) and put in the following contents:

Host *
    User nvidia
    IdentityFile ~/.ssh/id_rsa_host
    StrictHostKeyChecking no
    UserKnownHostsFile=/dev/null

Within the ssh_container directory (~/.ssh/ssh_container/config), create another config file for the keyless entry between containers:

Host *
    User nvidia
    IdentityFile /root/.ssh/id_rsa_cont
    StrictHostKeyChecking no
    UserKnownHostsFile=/dev/null
    LogLevel=Error
    ServerAliveInterval=30

Add Public SSH Keys To “Authorized_keys”#

For keyless entry to work on the workernodes, the contents of the public ssh keys need to be copied to the authorized_keys file for both internode communications and communications between the containers on separate nodes.

In the ~/.ssh folder:

echo `cat id_rsa_host.pub` > authorized_keys

In the ~/.ssh/ssh_container folder:

echo `cat id_rsa_cont.pub` > authorized_keys

Create mpicont.sh script#

Within the in the ~/.ssh directory, create a script called mpicont.sh with the following contents:
1mpicont.sh 2docker exec mpicont /bin/bash -c "$SSH_ORIGINAL_COMMAND"
Then make the script executable:
chmod +x mpicont.sh

Add Container SSH Key to the Master’s `authorized_keys` File#

Add the following line to the master authorized_keys file:

command="bash /home/nvidia/.ssh/mpicont.sh",no-port-forwarding,no-agent-forwarding,no-X11-forwarding <add contents of id_rsa_cont.pub>

Copy `~/.ssh` to Worker Nodes and Confirm Keyless entry#

Now we can copy all the files from the master node’s ~/.ssh directory to all of the worker nodes we specified in our nodelist.

scp -r .ssh $<worker_node_IP>:/home/nvidia/.ssh/;done

Change Permissions in the `ssh_container` on all Nodes#

On all the nodes, change the ownership of the ssh_container/config file so that the owner is root:

sudo chown root:root config

Then change the permissions to 600 for all files in the ssh_container folder.

sudo chmod 600 *

Below is a list of all the files that were copied over to the workers and their proper permissions:

~/.ssh$ ll *
-rw------- 1 nvidia nvidia  894 Jan 24 17:46 authorized_keys
-rw-r--r-- 1 nvidia nvidia  125 Jan 24 14:21 config
-rw------- 1 nvidia nvidia 1675 Jan 24 14:19 id_rsa_host
-rw-r--r-- 1 nvidia nvidia  396 Jan 24 14:19 id_rsa_host.pub
-rwxrwxr-x 1 nvidia nvidia   57 Jan 24 15:55 mpicont.sh*

ssh_container:

total 24
drwxrwxr-x 2 nvidia nvidia 4096 Feb  6 16:50 ./
drwxrwxr-x 4 nvidia nvidia 4096 Feb  7 11:29 ../
-rw------- 1 nvidia nvidia  396 Jan 24 15:58 authorized_keys
-rw------- 1 root   root    161 Jan 24 17:54 config
-rw------- 1 nvidia nvidia 1675 Jan 24 15:58 id_rsa_cont
-rw------- 1 nvidia nvidia  396 Jan 24 15:58 id_rsa_cont.pub

Now run Docker containers on all the worker nodes, using the following command:

sudo docker run -it --gpus=all --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --shm-size=1g --name=mpicont --device=/dev/infiniband -v /home/nvidia/.ssh/ssh_container:/root/.ssh <NVIDIA_AI_Enterprise_private_registry_username>/multinode:latest sleep infinity

On the master node, run:

sudo docker run -it --gpus=all --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --shm-size=1g --name=mpicont --device=/dev/infiniband -v /home/nvidia/.ssh/ssh_container:/root/.ssh <NVIDIA_AI_Enterprise_private_registry_username>/multinode:latest /bin/bash

To test if the ssh keyless mpi commands are running, run the following command depending on how many workers you have:

mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4" hostname

To verify available GPUs on all work nodes, run the following command:

mpirun --allow-run-as-root -H <worker1_IP>,<worker2_IP>,<worker3_IP> -np "3" nvidia-smi

Note

Our lab environment, np (number of processes or, in other words, number of GPUs) parameter is 4. Please modify the np parameter to reflect your environment.

The output should reflect the hostnames for all four nodes.

Install nv_peer_memory#

On each of the nodes install nv_peer_mem modules.

git clone https://github.com/Mellanox/nv_peer_memory.git

cd nv_peer_memory
./build_module.sh
cd /tmp
tar xzf /tmp/nvidia-peer-memory_1.0.orig.tar.gz
cd nvidia-peer-memory-1.0
dpkg-buildpackage -us -uc
dpkg -i <path to generated deb files>

Run Sample ResNet-50 Multi-Node Training#

Note

Ensure that ssh keyless mpi is running with the command below:

mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP> -np "4" hostname

Run the following command to test example ResNet-50 multi-node benchmark depending on the worker node count:

mpirun --allow-run-as-root -H <master_IP>,<worker1_IP>,<worker2_IP>,<worker3_IP>  -np "4 " -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO  python3 /workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 512 --use_fp16 --variable_update=horovod --xla=True

Interpreting the Results#

This benchmark reports images per second training performance at each reporting iteration. Use the last few values reported to represent training performance.

Done warm up
Step        Img/sec total_loss
Done warm up
Step        Img/sec total_loss
Done warm up
Step        Img/sec total_loss
Done warm up
Step        Img/sec total_loss
1   images/sec: 2100.6 +/- 0.0 (jitter = 0.0)       7.738
1   images/sec: 2100.8 +/- 0.0 (jitter = 0.0)       7.742
1   images/sec: 2100.2 +/- 0.0 (jitter = 0.0)       7.734
1   images/sec: 2100.8 +/- 0.0 (jitter = 0.0)       7.770
10  images/sec: 2100.0 +/- 61.9 (jitter = 6.6)      7.607
10  images/sec: 2100.4 +/- 60.4 (jitter = 189.7)    7.656
10  images/sec: 2100.9 +/- 59.2 (jitter = 88.7)     7.611
10  images/sec: 2100.9 +/- 59.0 (jitter = 175.8)    7.647
20  images/sec: 2100.2 +/- 39.4 (jitter = 92.3)     7.527
20  images/sec: 2100.2 +/- 43.8 (jitter = 198.3)    7.515
20  images/sec: 2100.1 +/- 41.1 (jitter = 181.8)    7.512
20  images/sec: 2100.1 +/- 43.0 (jitter = 14.7)     7.501
30  images/sec: 2100.9 +/- 34.9 (jitter = 198.3)    7.490
30  images/sec: 2100.4 +/- 35.3 (jitter = 11.1)     7.474
30  images/sec: 2100.7 +/- 33.3 (jitter = 92.9)     7.483
30  images/sec: 2100.3 +/- 34.9 (jitter = 157.3)    7.493
40  images/sec: 2100.5 +/- 28.3 (jitter = 76.4)     7.476
40  images/sec: 2100.9 +/- 31.2 (jitter = 193.8)    7.476
40  images/sec: 2100.5 +/- 31.2 (jitter = 186.9)    7.483
40  images/sec: 2100.2 +/- 31.5 (jitter = 18.9)     7.474
50  images/sec: 2100.8 +/- 28.1 (jitter = 15.0)     7.480
50  images/sec: 2100.3 +/- 28.3 (jitter = 168.8)    7.468
50  images/sec: 2100.7 +/- 25.7 (jitter = 76.4)     7.485
50  images/sec: 2100.2 +/- 27.4 (jitter = 218.1)    7.485
60  images/sec: 2100.2 +/- 25.6 (jitter = 173.0)    7.485
60  images/sec: 2100.3 +/- 23.3 (jitter = 66.1)     7.501
60  images/sec: 2100.4 +/- 24.8 (jitter = 190.7)    7.480
60  images/sec: 2100.2 +/- 26.4 (jitter = 20.6)     7.493
70  images/sec: 2100.4 +/- 24.3 (jitter = 16.4)     7.495
70  images/sec: 2100.4 +/- 23.9 (jitter = 157.3)    7.498
70  images/sec: 2100.0 +/- 22.1 (jitter = 52.3)     7.503
70  images/sec: 2100.5 +/- 23.4 (jitter = 218.3)    7.509
80  images/sec: 2100.3 +/- 22.4 (jitter = 157.3)    7.490
80  images/sec: 2100.2 +/- 20.6 (jitter = 50.7)     7.510
80  images/sec: 2100.6 +/- 21.7 (jitter = 195.2)    7.520
80  images/sec: 2100.2 +/- 22.4 (jitter = 30.3)     7.508
90  images/sec: 2100.8 +/- 21.2 (jitter = 22.3)     7.481
90  images/sec: 2100.1 +/- 20.8 (jitter = 157.3)    7.489
90  images/sec: 2100.7 +/- 19.7 (jitter = 35.1)     7.496
90  images/sec: 2100.7 +/- 20.7 (jitter = 218.1)    7.471
100 images/sec: 2100.2 +/- 20.2 (jitter = 30.3)     7.501
----------------------------------------------------------------
total images/sec: 8400.46
----------------------------------------------------------------
100 images/sec: 1520.1 +/- 19.9 (jitter = 166.6)    7.522
----------------------------------------------------------------
total images/sec: 8400.99
----------------------------------------------------------------
100 images/sec: 1517.6 +/- 18.6 (jitter = 52.3)     7.507
----------------------------------------------------------------
total images/sec: 8400.84
----------------------------------------------------------------
100 images/sec: 1517.9 +/- 19.6 (jitter = 219.0)    7.500
----------------------------------------------------------------
total images/sec: 8400.58
----------------------------------------------------------------

Getting Started#

Configure NVIDIA ConnectX-6 Dx NIC and Spectrum switch for RoCE#

Enable ATS on VMware ESXi and VMs#

Enable ATS on the NVIDIA ConnectX-6 Dx NIC#

Configure NUMA Affinity for the VMs#

Creating Docker File For Multi-Node Training#

Setup Keyless Entry Between VMs On The Multi-Node Cluster#

Generating SSH Keys#

Host/WorkerSSH Keys#

Container SSH Keys#

Creating Config Files for Keyless Entry#

Add Public SSH Keys To “Authorized_keys”#

Create mpicont.sh script#

Add Container SSH Key to the Master’s authorized_keys File#

Copy ~/.ssh to Worker Nodes and Confirm Keyless entry#

Change Permissions in the ssh_container on all Nodes#

Install nv_peer_memory#

Run Sample ResNet-50 Multi-Node Training#

Interpreting the Results#

Add Container SSH Key to the Master’s `authorized_keys` File#

Copy `~/.ssh` to Worker Nodes and Confirm Keyless entry#

Change Permissions in the `ssh_container` on all Nodes#