Step #2: Multi-Node Training

Multi-Node Training for AI on Kubernetes (VMware Tanzu) (Latest Version)

Now that you have trained the Image Classification on a single node, next you will take a look at how you can scale deep learning training to run on two nodes using the MPI operator on Kubernetes with VMware Tanzu.

Note

As part of this lab, you have been given access to a three node Tanzu Kubernetes Cluster. Each node of this cluster is a VM with 16 cores of CPU, 64 GB RAM and has an A30 GPU with 24 GB of GPU memory attached to it. The cluster has the GPU operator already installed on it.

The GPU operator installs the GPU driver on each node of the cluster as well as plugins that make your Kubernetes cluster GPU aware. If you want to learn more about how to create a Tanzu Kubernetes cluster and install the GPU operator, request access to the VMware Tanzu IT administrator experience on NVIDIA Launchpad.

Prior to running the multinode training, you will install the MPI operator.

  1. Open the VM Console link on the left navigation pane.

    Note

    This console gives you access to the kubectl client.


  2. Run the following command on the console to login to the Kubernetes cluster.

    Copy
    Copied!
                

    bash login.sh


  3. Use the following password on prompt.

    Login password: ${sso_password}

    To make sure that the namespace for mpi-operator has permissions to launch pods, a pod security policy is required to deploy workloads on a Tanzu Kubernetes Cluster. A Kubernetes rolebinding with the default security policy has already been created for you. For your reference, below are the contents of the rolebinding.

    Copy
    Copied!
                

    apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: psp:vmware-system-privileged:default namespace: mpi-operator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: psp:vmware-system-privileged subjects: - apiGroup: rbac.authorization.k8s.io kind: Group


  4. Apply the role binding by running the following command.

    Copy
    Copied!
                

    kubectl apply -f rolebinding.yaml


  5. Clone the MPI operator and apply it to the cluster to install the MPI operator.

    Copy
    Copied!
                

    git clone https://github.com/kubeflow/mpi-operator cd mpi-operator kubectl apply -f deploy/v2beta1/mpi-operator.yaml

    Note

    The MPI operator launches a master container/pod and multiple worker containers/pods (as many as the number of GPUs workers specified). All the workers talk to the master container/pod by first establishing a ssh connection. The master container/pod acts as the parameter server during allreduce as discussed previously.

    Next, you will use the NVIDIA AI Enterprise TensorFlow container as the base container and add ssh libraries for container communication. Further, you will disable strict host checking for SSH for passwordless communication. The Dockerfile for the MPI operator is readily available on the github page for the operator.

    Copy
    Copied!
                

    FROM nvcr.io/nvaie/tensorflow:22.05-tf2-py3 ARG port=2222 RUN apt update && apt install -y --no-install-recommends \ openssh-server \ openssh-client \ libcap2-bin \ && rm -rf /var/lib/apt/lists/* # Add privilege separation directory to run sshd as root. RUN mkdir -p /var/run/sshd RUN mkdir -p /opt/nvidia COPY image_classification.py /opt/nvidia/image_classification.py # Add capability to run sshd as non-root. RUN setcap CAP_NET_BIND_SERVICE=+eip /usr/sbin/sshd # Allow OpenSSH to talk to containers without asking for confirmation # by disabling StrictHostKeyChecking. # mpi-operator mounts the .ssh folder from a Secret. For that to work, we need # to disable UserKnownHostsFile to avoid write permissions. # Disabling StrictModes avoids directory and files read permission checks. RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config \ && echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \ && sed -i "s/[ #]\(.*Port \).*/ \1$port/g" /etc/ssh/ssh_config \ && sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config \ && sed -i "s/#\(Port \).*/\1$port/g" /etc/ssh/sshd_config RUN useradd -m mpiuser WORKDIR /home/mpiuser # Configurations for running sshd as non-root. COPY --chown=mpiuser sshd_config .sshd_config RUN echo "Port$port" >> /home/mpiuser/sshd_config

    Important

    In a production scenario you would have to build the container (docker build) and then upload it to your private container registry. For this lab however, we have built and uploaded the container for you to NVIDIA’s LaunchPad private registry on NGC (NVIDIA GPU Cloud) and ready for you to use. (nvcr.io/nvlp-aienterprise/nvaie-mpi-operator)


  6. Let’s take a look at the MPI Job spec used to run the Multi Node training:

    I have marked/labelled the key values of the spec below from 1 to 14 and we will go over the meaning of these key values in great detail in order to understand how the Kubernetes MPI job is created for Multinode training. The spec below will be already available for you in the VM Console for you to run the training.

    Copy
    Copied!
                

    apiVersion: kubeflow.org/v2beta1 kind: MPIJob --------------------------------------------- (1) metadata: name: tensorflow-launchpad —-----------------------------(2) spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running —------------------------------(3) mpiReplicaSpecs: Launcher: —--------------------------------------------(4) replicas: 1 —----------------------------------------(5) template: spec: imagePullSecrets: —-----------------------------(6) - name: imagepullsecret containers: - image: nvcr.io/nvlp-aienterprise/nvaie-mpi-operator:latest —--(7) name: tensorflow-launchpad command: - mpirun —----------------------------------------------------(8) - --allow-run-as-root - -np - "2" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python3 - /opt/nvidia/image_classification.py - --dataset_path=/datasets Worker: —-----------------------------------------------------------(9) replicas: 2 —-----------------------------------------------------(10) template: spec: containers: - image: nvcr.io/nvlp-aienterprise/nvaie-mpi-operator:latest name: tensorflow-launchpad volumeMounts: —----------------------------------------------(11) - mountPath: /datasets name: dataset-volume resources: limits: nvidia.com/gpu: 1 —---------------------------------------(12) imagePullSecrets: - name: imagepullsecret initContainers: —-------------------------------------------(13) - name: dataset-download image: ubuntu:latest volumeMounts: - mountPath: /datasets name: dataset-volume command: ["/bin/sh","-c"] args: - "apt-get update; apt install -y wget unzip; \ wget ftp://cs.stanford.edu/cs/cvgl/Stanford_Online_Products.zip -P /datasets; \ unzip /datasets/Stanford_Online_Products.zip -d /datasets/; \ chmod 777 /datasets/*" volumes: —--------------------------------------------------(14) - name: dataset-volume hostPath: path: /opt/datasets type: DirectoryOrCreate

    1. We first define the type/kind of Kubernetes resource we want. In this case it is a custom resource of type MPIJob (from mpi-operator).

    2. We name the MPI master and worker pods in the cluster using the name metadata tags to tensorflow-launchpad.

    3. The cleanpodpolicy controls the deletion of the worker pods when the job terminates it is set to Running i.e the pods get deleted when they are finished Running.

    4. The mpiReplicaSpecs is where the actual specification of the master and worker pods is defined. The specification for the master pod/container is defined under the Launcher sub spec and the specification for the worker under the Worker sub spec.

    5. The number of replicas on the Launcher is set to 1 as we need only one parameter server.

    6. The Imagepullsecret is a Kubernetes secret which is required to pull a container from the private registry in our case the launchpad’s private registry on NGC.

    7. The containerspec has an image key which is set to the docker container image we created and uploaded to the launchpad registry. Dockerfile for which was shown previously.

    8. The command entry has the actual command that runs the mpi command on the master pod/container. In this case we use mpirun utility. We set the np ie number of worker processes to 2 and the mpi operator will use 2 GPUs one per process to run the training on. We also use the image_classification.py which is the same notebook we used to explain horovod in the single node example and we set the dataset_path to the dataset volume mounted on the container.

    9. Next, we show the specification for the worker as part of the Worker sub spec.

    10. We set the number of worker replicas to 2 as we want to use 2 GPUs one per process as specified under the mpirun command.

    11. The volumeMounts spec mounts the dataset-volume created under (16) into the worker pods under /dataset path which is then passed to the image_classification.py script as part of the mpirun command shown previously.

    12. We also request one GPU per worker pod/container.

    13. Before running the training we create an initcontainer which downloads and unzips the Stanford online Image dataset as part of the training initialization process to the /datasets path where the volume is mounted.

    14. Finally, the volumes subspec creates a folder on each node/VM (i.e a hostpath volume) that is part of the kubernetes cluster that is being mounted on each worker pod.

Now that we have learnt a bit about the Kubernetes spec for multinode training let’s see the training in action.

On your VM Console you should be able to see multinode-training.yaml in the home directory. Run the following command to start the multinode training.

Copy
Copied!
            

kubectl apply -f multinode-training.yaml -n mpi-operator


The training will take a few minutes to begin for the Kubernetes to pull the image from the private registry and run the training. Wait for a few minutes and you should be able to see two worker pods and one master pod running the training when you run the following command.

Copy
Copied!
            

kubectl get pods -n mpi-operator


Finally you can see the training logs using.

Copy
Copied!
            

kubectl logs -f <name_of_the_master_pod> -n mpi-operator


© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.