NVIDIA Docs Hub NVIDIA LaunchPad Multi-Node Training for AI on Kubernetes (VMware Tanzu) Step #2: Multi-Node Training

Step #2: Multi-Node Training

Now that you have trained the Image Classification on a single node, next you will take a look at how you can scale deep learning training to run on two nodes using the MPI operator on Kubernetes with VMware Tanzu.

Note

As part of this lab, you have been given access to a three node Tanzu Kubernetes Cluster. Each node of this cluster is a VM with 16 cores of CPU, 64 GB RAM and has an A30 GPU with 24 GB of GPU memory attached to it. The cluster has the GPU operator already installed on it.

The GPU operator installs the GPU driver on each node of the cluster as well as plugins that make your Kubernetes cluster GPU aware. If you want to learn more about how to create a Tanzu Kubernetes cluster and install the GPU operator, request access to the VMware Tanzu IT administrator experience on NVIDIA Launchpad.

Prior to running the multinode training, you will install the MPI operator.

Open the VM Console link on the left navigation pane.

Note

This console gives you access to the kubectl client.
Run the following command on the console to login to the Kubernetes cluster.
Copy

Copied!
```
            
            bash login.sh
        
```
Use the following password on prompt.
Login password: ${sso_password}

To make sure that the namespace for mpi-operator has permissions to launch pods, a pod security policy is required to deploy workloads on a Tanzu Kubernetes Cluster. A Kubernetes rolebinding with the default security policy has already been created for you. For your reference, below are the contents of the rolebinding.
Copy

Copied!
```
            
            apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
    name: psp:vmware-system-privileged:default
    namespace: mpi-operator
roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: psp:vmware-system-privileged
subjects:
- apiGroup: rbac.authorization.k8s.io
    kind: Group
        
```

Apply the role binding by running the following command.

Copy
Copied!

            
            kubectl apply -f rolebinding.yaml

Clone the MPI operator and apply it to the cluster to install the MPI operator.

Copy
Copied!

            
            git clone https://github.com/kubeflow/mpi-operator
cd mpi-operator
kubectl apply -f deploy/v2beta1/mpi-operator.yaml

Note

The MPI operator launches a master container/pod and multiple worker containers/pods (as many as the number of GPUs workers specified). All the workers talk to the master container/pod by first establishing a ssh connection. The master container/pod acts as the parameter server during allreduce as discussed previously.

Next, you will use the NVIDIA AI Enterprise TensorFlow container as the base container and add ssh libraries for container communication. Further, you will disable strict host checking for SSH for passwordless communication. The Dockerfile for the MPI operator is readily available on the github page for the operator.

Copy
Copied!

            
            FROM nvcr.io/nvaie/tensorflow:22.05-tf2-py3

ARG port=2222

RUN apt update && apt install -y --no-install-recommends \
                        openssh-server \
                        openssh-client \
                        libcap2-bin    \
                && rm -rf /var/lib/apt/lists/*
# Add privilege separation directory to run sshd as root.
RUN mkdir -p /var/run/sshd

RUN mkdir -p /opt/nvidia

COPY image_classification.py /opt/nvidia/image_classification.py
# Add capability to run sshd as non-root.
RUN setcap CAP_NET_BIND_SERVICE=+eip /usr/sbin/sshd

# Allow OpenSSH to talk to containers without asking for confirmation
# by disabling StrictHostKeyChecking.
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
# to disable UserKnownHostsFile to avoid write permissions.
# Disabling StrictModes avoids directory and files read permission checks.
RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config \
    && echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
    && sed -i "s/[ #]\(.*Port \).*/ \1$port/g" /etc/ssh/ssh_config \
    && sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config \
    && sed -i "s/#\(Port \).*/\1$port/g" /etc/ssh/sshd_config
RUN useradd -m mpiuser
WORKDIR /home/mpiuser
# Configurations for running sshd as non-root.
COPY --chown=mpiuser sshd_config .sshd_config
RUN echo "Port$port" >> /home/mpiuser/sshd_config

Important

In a production scenario you would have to build the container (docker build) and then upload it to your private container registry. For this lab however, we have built and uploaded the container for you to NVIDIA’s LaunchPad private registry on NGC (NVIDIA GPU Cloud) and ready for you to use. (nvcr.io/nvlp-aienterprise/nvaie-mpi-operator)

Let’s take a look at the MPI Job spec used to run the Multi Node training:

I have marked/labelled the key values of the spec below from 1 to 14 and we will go over the meaning of these key values in great detail in order to understand how the Kubernetes MPI job is created for Multinode training. The spec below will be already available for you in the VM Console for you to run the training.

Copy
Copied!

            
            apiVersion: kubeflow.org/v2beta1
kind: MPIJob --------------------------------------------- (1)
metadata:
    name: tensorflow-launchpad —-----------------------------(2)
spec:
    slotsPerWorker: 1
    runPolicy:
        cleanPodPolicy: Running —------------------------------(3)
    mpiReplicaSpecs:
        Launcher: —--------------------------------------------(4)
        replicas: 1 —----------------------------------------(5)
        template:
            spec:
            imagePullSecrets: —-----------------------------(6)
                - name: imagepullsecret
            containers:
            - image: nvcr.io/nvlp-aienterprise/nvaie-mpi-operator:latest —--(7)
                name: tensorflow-launchpad
                command:
                - mpirun —----------------------------------------------------(8)
                - --allow-run-as-root
                - -np
                - "2"
                - -bind-to
                - none
                - -map-by
                - slot
                - -x
                - NCCL_DEBUG=INFO
                - -x
                - LD_LIBRARY_PATH
                - -x
                - PATH
                - -mca
                - pml
                - ob1
                - -mca
                - btl
                - ^openib
                - python3
                - /opt/nvidia/image_classification.py
                - --dataset_path=/datasets
        Worker: —-----------------------------------------------------------(9)
        replicas: 2 —-----------------------------------------------------(10)
        template:
            spec:
            containers:
            - image: nvcr.io/nvlp-aienterprise/nvaie-mpi-operator:latest
                name: tensorflow-launchpad
                volumeMounts: —----------------------------------------------(11)
                - mountPath: /datasets
                name: dataset-volume
                resources:
                limits:
                    nvidia.com/gpu: 1 —---------------------------------------(12)
            imagePullSecrets:
                - name: imagepullsecret
            initContainers: —-------------------------------------------(13)
            - name: dataset-download
                image: ubuntu:latest
                volumeMounts:
                - mountPath: /datasets
                name: dataset-volume
                command: ["/bin/sh","-c"]
                args:
                - "apt-get update; apt install -y wget unzip; \
wget ftp://cs.stanford.edu/cs/cvgl/Stanford_Online_Products.zip -P /datasets; \
unzip /datasets/Stanford_Online_Products.zip -d /datasets/; \
chmod 777 /datasets/*"
            volumes: —--------------------------------------------------(14)
            - name: dataset-volume
                hostPath:
                path: /opt/datasets
                type: DirectoryOrCreate

We first define the type/kind of Kubernetes resource we want. In this case it is a custom resource of type MPIJob (from mpi-operator).
We name the MPI master and worker pods in the cluster using the name metadata tags to tensorflow-launchpad.
The cleanpodpolicy controls the deletion of the worker pods when the job terminates it is set to Running i.e the pods get deleted when they are finished Running.
The mpiReplicaSpecs is where the actual specification of the master and worker pods is defined. The specification for the master pod/container is defined under the Launcher sub spec and the specification for the worker under the Worker sub spec.
The number of replicas on the Launcher is set to 1 as we need only one parameter server.
The Imagepullsecret is a Kubernetes secret which is required to pull a container from the private registry in our case the launchpad’s private registry on NGC.
The containerspec has an image key which is set to the docker container image we created and uploaded to the launchpad registry. Dockerfile for which was shown previously.
The command entry has the actual command that runs the mpi command on the master pod/container. In this case we use mpirun utility. We set the np ie number of worker processes to 2 and the mpi operator will use 2 GPUs one per process to run the training on. We also use the image_classification.py which is the same notebook we used to explain horovod in the single node example and we set the dataset_path to the dataset volume mounted on the container.
Next, we show the specification for the worker as part of the Worker sub spec.
We set the number of worker replicas to 2 as we want to use 2 GPUs one per process as specified under the mpirun command.
The volumeMounts spec mounts the dataset-volume created under (16) into the worker pods under /dataset path which is then passed to the image_classification.py script as part of the mpirun command shown previously.
We also request one GPU per worker pod/container.
Before running the training we create an initcontainer which downloads and unzips the Stanford online Image dataset as part of the training initialization process to the /datasets path where the volume is mounted.
Finally, the volumes subspec creates a folder on each node/VM (i.e a hostpath volume) that is part of the kubernetes cluster that is being mounted on each worker pod.

Now that we have learnt a bit about the Kubernetes spec for multinode training let’s see the training in action.

On your VM Console you should be able to see multinode-training.yaml in the home directory. Run the following command to start the multinode training.

Copy
Copied!

            
            kubectl apply -f multinode-training.yaml -n mpi-operator

The training will take a few minutes to begin for the Kubernetes to pull the image from the private registry and run the training. Wait for a few minutes and you should be able to see two worker pods and one master pod running the training when you run the following command.

Copy
Copied!

            
            kubectl get pods -n mpi-operator

Finally you can see the training logs using.

Copy
Copied!

            
            kubectl logs -f <name_of_the_master_pod> -n mpi-operator