Setup - NVIDIA Docs

The TAO Toolkit API service can run on any Kubernetes platform. The two platforms officially supported are AWS EKS and Bare-Metal.

Bare-Metal Setup

Hardware minimum requirements

1 or more GPU node(s) where all GPUs within a given node match.

32 GB system RAM
32 GB of GPU RAM
8 core CPU
1 NVIDIA Discrete GPU: Volta, Turing, Ampere architecture
16 GB of SSD space

Software requirement

Before installing Kubernetes, each node in the cluster should be setup with a fresh Ubuntu 18.04 or latest.

No NVIDIA driver

List installed drivers.

Copy
Copied!

            
            apt list --installed | grep '^nvidia-.*'

Remove any installed.

Copy
Copied!

            
            sudo apt-get remove --purge '^nvidia-.*'

No nouveau driver

Blacklist the nouveau driver.

Copy
Copied!

            
            sudo echo -e "\nblacklist nouveau\noptions nouveau modeset=0\n" >> /etc/modprobe.d/blacklist-nouveau.conf

Regenerate the kernel initramfs.

Copy
Copied!

            
            sudo update-initramfs -u

Reboot.

Copy
Copied!

            
            sudo reboot

SSH Passwordless

Before installing Kubernetes, one needs to setup ssh-passwordless so the node used for driving the installation can run remote commands on each node of your cluster.

First, one must make sure an ssh key has been generated for the local current user.

Copy
Copied!

            
            [ ! -f ~/.ssh/id_rsa ] && ssh-keygen -t rsa -b 4096

Then, one must allow this user to login as a sudoer user on each node of your cluster.

Copy
Copied!

            
            ssh-copy-id $USER@node1

If the user is not already a sudoer on each node of your cluster, then one must login as root on each node and add the user to the node’s /etc/sudoers file.

Copy
Copied!

            
            echo "$user ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers

Kubernetes

We suggest deploying Kubernetes using Kubespray. The most recent tested version is v2.18.0.

First, check /etc/resolv.conf of each node in your cluster and make sure only one (or two) domain is specified as follows.

Copy
Copied!

            
            cat /etc/resolv.conf

nameserver 127.0.0.53
options edns0 trust-ad
search nvidia.com nvc.nvidia.com

Then, swap should be disabled for each node on your cluster.

Copy
Copied!

            
            sudo swapoff -a
sudo sed -i.bak '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

On your node driving the installation, get a local copy of the Kubespray project.

Copy
Copied!

            
            git clone https://github.com/kubernetes-sigs/kubespray.git
cd kubespray
git checkout v2.18.0

Make sure ansible is not already installed from Ubuntu distro.

Copy
Copied!

            
            sudo apt remove ansible

Install dependencies.

Copy
Copied!

            
            sudo pip3 install -r requirements.txt

Create a new inventory.

Copy
Copied!

            
            cp -rfp inventory/sample inventory/mycluster

Customize your inventory. Please use IP addresses of the nodes in your cluster. A one-node cluster is supported.

Copy
Copied!

            
            declare -a IPS=(172.17.171.248 172.17.171.249 172.17.171.250)
CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

Proceed with the installation of Kubernetes in your cluster.

Copy
Copied!

            
            ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root cluster.yml

If something went wrong, one can undo the cluster installation with:

Copy
Copied!

            
            ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root reset.yml

After a successful installation, one must copy the kube config file for the cluster admin user. Below command implies the installation node is one of your cluster node.

Copy
Copied!

            
            mkdir ~/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $USER ~/.kube/config

Most popular Kubernetes commands are the following. Note the -n option to specify the namespace is optional with default as the default namespace name. The -A option is for all namespaces.

Copy
Copied!

            
            kubectl get nodes
kubectl get pods -A
kubectl get pods -n default
kubectl get services -n default
kubectl get storageclasses -n default
kubectl get persistentvolumes -n default
kubectl get persistentvolumeclaims -n default
kubectl get pvc -n default
kubectl describe pods tao-toolkit-api -n default
kubectl logs tao-toolkit-api -n default

For a more exhaustive cheat sheet, please refer to: https://kubernetes.io/docs/reference/kubectl/cheatsheet/

Helm

Several online sites provide tutorial and cheat sheets for Helm commands, for example: https://www.tutorialworks.com/helm-cheatsheet/

One can install Helm manually with:

Copy
Copied!

            
            sudo apt-get install helm

NVIDIA GPU Operator

One can make sure nodes are schedulable with:

Copy
Copied!

            
            kubectl get nodes -o name | xargs kubectl uncordon

If GPU Operator was previously uninstalled, you might need to run the following before a new install:

Copy
Copied!

            
            kubectl delete crd clusterpolicies.nvidia.com

One can deploy newer gpu-operator with:

Copy
Copied!

            
            helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator --set driver.repository=nvcr.io/nvidia --set driver.version="510.47.03" --set driver.imagePullPolicy=Always

In case Kubernetes default runtime is containerd, one must add the parameter --set operator.defaultRuntime="containerd" to the above command.

You can wait a few minutes and check that all GPU related pods are in good health, meaning Running or Terminated states:

Copy
Copied!

            
            kubectl get pods -A

If the GPU pods are failing, check once more that no NVIDIA or nouveau drivers are installed on the nodes.

Set Node Labels

List nodes and labels with:

Copy
Copied!

            
            kubectl get nodes --show-labels

Then, set node label with (example):

Copy
Copied!

            
            kubectl label nodes node1 accelerator=a100

NGINX Ingress Controller

One can install the ingress controller with:

Copy
Copied!

            
            helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx

NFS Server

If you do not have an NFS share already available, below is the procedure for installing one.

Install the server package.

Copy
Copied!

            
            sudo apt install nfs-kernel-server

Create an Export Directory.

Copy
Copied!

            
            sudo mkdir -p /mnt/nfs_share
sudo chown -R nobody:nogroup /mnt/nfs_share/
sudo chmod 777 /mnt/nfs_share/

Grant access to Client Systems.

Copy
Copied!

            
            echo "/mnt/nfs_share  *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a
sudo systemctl restart nfs-kernel-server

Storage Provisioner

Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.

Copy
Copied!

            
            helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --set nfs.server=172.17.171.248 \
  --set nfs.path=/mnt/nfs_share

Image Pull Secret for nvcr.io

In this example, one must set his ngc-api-key, ngc-email and deployment namespace.

Copy
Copied!

            
            kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'

Where:

ngc-api-key can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Key
ngc-email is the email one used to sign in above
namespace is the Kubernetes namespace one uses for deployment, or “default”

AWS EKS Setup

Pre-requisite installation on your local machine

Install AWS CLI

Follow this link to install aws cli on your platform

Install kubectl

Follow this link to install kubectl on your local machine

Install eksctl

Follow this link to install eksctl on your local machine

Install helm

Follow this link to install helm on your local machine

Cluster setup

AWS CLI Configure

Copy
Copied!

            
            aws configure

The ‘Access Key ID’ and ‘Secret Access Key’ are obtained from the csv file which you can download while creating an aws account.

region name: an aws availability zone from here which has the services and instances you like, for example p3.2xlarge ec2 instance is present in us-west-2 and not present in us-west-1

output format: json

Create EC2 Key Pair

Create a ec2 key pair for enabling a ssh connection to the Kubernetes worker nodes

Copy
Copied!

            
            aws ec2 create-key-pair --key-name key_name --key-type rsa --query "KeyMaterial" --output text > key_name.pem

Copy
Copied!

            
            aws ec2 create-key-pair --key-name ec2_dev_key --key-type rsa --query "KeyMaterial" --output text > ec2_dev_key.pem

Change the permission of the pem key

Copy
Copied!

            
            chmod 400 key_name.pem

To confirm the creation of the key login to the AWS console (webpage)

The link for login will be present in the csv file you use to obtain the secret and access key ID for ‘aws configure’

Choose the region appropriately in the top right-hand corner as per the one you used at ‘aws configure’ step

In the top left search box near services and type ‘key pairs’ and select ‘Key Pairs’ associated with EC2 feature.

You should see your key present there, if the creation was successful

EKS Cluster Creation

Copy
Copied!

            
            eksctl create cluster --name cluster_name --node-type ec2_instance_type --nodes num_worker_nodes \
  --region aws_availability_zone --ssh-access --ssh-public-key key_name

Copy
Copied!

            
            eksctl create cluster --name tao-eks-cluster --node-type p3.2xlarge --nodes 1 \
  --region us-west-2 --ssh-access --ssh-public-key ec2_dev_key

It’ll take around 15 min for the cluster creation to complete

You can verify the cluster creation on AWS Console by navigating to ‘Elastic Kubernetes Service’ under AWS Services and then select the ‘Cluster’ feature

You’ll see your cluster name on the webpage

Also, you can go to the ‘EC2’ service and then select ‘Instances’ feature, to see your worker node instance

Configuring Kubernetes pods to access GPU resources

Copy
Copied!

            
            kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.

Copy
Copied!

            
            kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

You should see all the nodes in your cluster with the number of gpu’s for that instance

Software setup

Install NGINX Ingress Controller

Carry out the following commands on your local machine

Copy
Copied!

            
            helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx

Set Node Labels

List nodes and labels with:

Copy
Copied!

            
            kubectl get nodes --show-labels

Then, set node label with (example):

Copy
Copied!

            
            kubectl label nodes node1 accelerator=v100

NFS Server

If you do not have an NFS share already available within your cluster, below is the procedure for installing one on a RH-based EC2 instance. Please plan for enough storage to accommodate all your datasets and model exeriments.

Install the nfs package.

Copy
Copied!

            
            sudo yum install nfs-utils
sudo systemctl enable --now nfs-server rpcbind