Setup

The TAO Toolkit API service can run on any Kubernetes platform. The two platforms officially supported are AWS EKS and Bare-Metal.

Hardware minimum requirements

1 or more GPU node(s) where all GPUs within a given node match.

  • 32 GB system RAM

  • 32 GB of GPU RAM

  • 8 core CPU

  • 1 NVIDIA Discrete GPU: Volta, Turing, Ampere architecture

  • 16 GB of SSD space

Software requirement

Before installing Kubernetes, each node in the cluster should be setup with a fresh Ubuntu 18.04 or latest.

No NVIDIA driver

List installed drivers.

Copy
Copied!
            

apt list --installed | grep '^nvidia-.*'

Remove any installed.

Copy
Copied!
            

sudo apt-get remove --purge '^nvidia-.*'


No nouveau driver

Blacklist the nouveau driver.

Copy
Copied!
            

sudo echo -e "\nblacklist nouveau\noptions nouveau modeset=0\n" >> /etc/modprobe.d/blacklist-nouveau.conf

Regenerate the kernel initramfs.

Copy
Copied!
            

sudo update-initramfs -u

Reboot.

Copy
Copied!
            

sudo reboot


SSH Passwordless

Before installing Kubernetes, one needs to setup ssh-passwordless so the node used for driving the installation can run remote commands on each node of your cluster.

First, one must make sure an ssh key has been generated for the local current user.

Copy
Copied!
            

[ ! -f ~/.ssh/id_rsa ] && ssh-keygen -t rsa -b 4096

Then, one must allow this user to login as a sudoer user on each node of your cluster.

Copy
Copied!
            

ssh-copy-id $USER@node1

If the user is not already a sudoer on each node of your cluster, then one must login as root on each node and add the user to the node’s /etc/sudoers file.

Copy
Copied!
            

echo "$user ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers


Kubernetes

We suggest deploying Kubernetes using Kubespray. The most recent tested version is v2.18.0.

First, check /etc/resolv.conf of each node in your cluster and make sure only one (or two) domain is specified as follows.

Copy
Copied!
            

cat /etc/resolv.conf nameserver 127.0.0.53 options edns0 trust-ad search nvidia.com nvc.nvidia.com

Then, swap should be disabled for each node on your cluster.

Copy
Copied!
            

sudo swapoff -a sudo sed -i.bak '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

On your node driving the installation, get a local copy of the Kubespray project.

Copy
Copied!
            

git clone https://github.com/kubernetes-sigs/kubespray.git cd kubespray git checkout v2.18.0

Make sure ansible is not already installed from Ubuntu distro.

Copy
Copied!
            

sudo apt remove ansible

Install dependencies.

Copy
Copied!
            

sudo pip3 install -r requirements.txt

Create a new inventory.

Copy
Copied!
            

cp -rfp inventory/sample inventory/mycluster

Customize your inventory. Please use IP addresses of the nodes in your cluster. A one-node cluster is supported.

Copy
Copied!
            

declare -a IPS=(172.17.171.248 172.17.171.249 172.17.171.250) CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

Proceed with the installation of Kubernetes in your cluster.

Copy
Copied!
            

ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

If something went wrong, one can undo the cluster installation with:

Copy
Copied!
            

ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root reset.yml

After a successful installation, one must copy the kube config file for the cluster admin user. Below command implies the installation node is one of your cluster node.

Copy
Copied!
            

mkdir ~/.kube sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $USER ~/.kube/config

Most popular Kubernetes commands are the following. Note the -n option to specify the namespace is optional with default as the default namespace name. The -A option is for all namespaces.

Copy
Copied!
            

kubectl get nodes kubectl get pods -A kubectl get pods -n default kubectl get services -n default kubectl get storageclasses -n default kubectl get persistentvolumes -n default kubectl get persistentvolumeclaims -n default kubectl get pvc -n default kubectl describe pods tao-toolkit-api -n default kubectl logs tao-toolkit-api -n default

For a more exhaustive cheat sheet, please refer to: https://kubernetes.io/docs/reference/kubectl/cheatsheet/

Helm

Several online sites provide tutorial and cheat sheets for Helm commands, for example: https://www.tutorialworks.com/helm-cheatsheet/

One can install Helm manually with:

Copy
Copied!
            

sudo apt-get install helm


NVIDIA GPU Operator

One can make sure nodes are schedulable with:

Copy
Copied!
            

kubectl get nodes -o name | xargs kubectl uncordon

If GPU Operator was previously uninstalled, you might need to run the following before a new install:

Copy
Copied!
            

kubectl delete crd clusterpolicies.nvidia.com

One can deploy newer gpu-operator with:

Copy
Copied!
            

helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update helm install gpu-operator nvidia/gpu-operator --set driver.repository=nvcr.io/nvidia --set driver.version="510.47.03" --set driver.imagePullPolicy=Always

In case Kubernetes default runtime is containerd, one must add the parameter --set operator.defaultRuntime="containerd" to the above command.

You can wait a few minutes and check that all GPU related pods are in good health, meaning Running or Terminated states:

Copy
Copied!
            

kubectl get pods -A

If the GPU pods are failing, check once more that no NVIDIA or nouveau drivers are installed on the nodes.

Set Node Labels

List nodes and labels with:

Copy
Copied!
            

kubectl get nodes --show-labels

Then, set node label with (example):

Copy
Copied!
            

kubectl label nodes node1 accelerator=a100


NGINX Ingress Controller

One can install the ingress controller with:

Copy
Copied!
            

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install ingress-nginx ingress-nginx/ingress-nginx


NFS Server

If you do not have an NFS share already available, below is the procedure for installing one.

Install the server package.

Copy
Copied!
            

sudo apt install nfs-kernel-server

Create an Export Directory.

Copy
Copied!
            

sudo mkdir -p /mnt/nfs_share sudo chown -R nobody:nogroup /mnt/nfs_share/ sudo chmod 777 /mnt/nfs_share/

Grant access to Client Systems.

Copy
Copied!
            

echo "/mnt/nfs_share *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports sudo exportfs -a sudo systemctl restart nfs-kernel-server


Storage Provisioner

Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.

Copy
Copied!
            

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ --set nfs.server=172.17.171.248 \ --set nfs.path=/mnt/nfs_share


Image Pull Secret for nvcr.io

In this example, one must set his ngc-api-key, ngc-email and deployment namespace.

Copy
Copied!
            

kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'

Where:

  • ngc-api-key can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Key

  • ngc-email is the email one used to sign in above

  • namespace is the Kubernetes namespace one uses for deployment, or “default”

Pre-requisite installation on your local machine

Install AWS CLI

Follow this link to install aws cli on your platform

Install kubectl

Follow this link to install kubectl on your local machine

Install eksctl

Follow this link to install eksctl on your local machine

Install helm

Follow this link to install helm on your local machine

Cluster setup

AWS CLI Configure

Copy
Copied!
            

aws configure

The ‘Access Key ID’ and ‘Secret Access Key’ are obtained from the csv file which you can download while creating an aws account.

region name: an aws availability zone from here which has the services and instances you like, for example p3.2xlarge ec2 instance is present in us-west-2 and not present in us-west-1

output format: json

Create EC2 Key Pair

Create a ec2 key pair for enabling a ssh connection to the Kubernetes worker nodes

Copy
Copied!
            

aws ec2 create-key-pair --key-name key_name --key-type rsa --query "KeyMaterial" --output text > key_name.pem

Copy
Copied!
            

aws ec2 create-key-pair --key-name ec2_dev_key --key-type rsa --query "KeyMaterial" --output text > ec2_dev_key.pem

Change the permission of the pem key

Copy
Copied!
            

chmod 400 key_name.pem

To confirm the creation of the key login to the AWS console (webpage)

The link for login will be present in the csv file you use to obtain the secret and access key ID for ‘aws configure’

Choose the region appropriately in the top right-hand corner as per the one you used at ‘aws configure’ step

In the top left search box near services and type ‘key pairs’ and select ‘Key Pairs’ associated with EC2 feature.

You should see your key present there, if the creation was successful

EKS Cluster Creation

Copy
Copied!
            

eksctl create cluster --name cluster_name --node-type ec2_instance_type --nodes num_worker_nodes \ --region aws_availability_zone --ssh-access --ssh-public-key key_name

Copy
Copied!
            

eksctl create cluster --name tao-eks-cluster --node-type p3.2xlarge --nodes 1 \ --region us-west-2 --ssh-access --ssh-public-key ec2_dev_key

It’ll take around 15 min for the cluster creation to complete

You can verify the cluster creation on AWS Console by navigating to ‘Elastic Kubernetes Service’ under AWS Services and then select the ‘Cluster’ feature

You’ll see your cluster name on the webpage

Also, you can go to the ‘EC2’ service and then select ‘Instances’ feature, to see your worker node instance

Configuring Kubernetes pods to access GPU resources

Copy
Copied!
            

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.

Copy
Copied!
            

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

You should see all the nodes in your cluster with the number of gpu’s for that instance

Software setup

Install NGINX Ingress Controller

Carry out the following commands on your local machine

Copy
Copied!
            

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install ingress-nginx ingress-nginx/ingress-nginx


Set Node Labels

List nodes and labels with:

Copy
Copied!
            

kubectl get nodes --show-labels

Then, set node label with (example):

Copy
Copied!
            

kubectl label nodes node1 accelerator=v100


NFS Server

If you do not have an NFS share already available within your cluster, below is the procedure for installing one on a RH-based EC2 instance. Please plan for enough storage to accommodate all your datasets and model exeriments.

Install the nfs package.

Copy
Copied!
            

sudo yum install nfs-utils sudo systemctl enable --now nfs-server rpcbind

Create an Export Directory.

Copy
Copied!
            

sudo mkdir -p /mnt/nfs_share sudo chown -R nobody:nogroup /mnt/nfs_share/ sudo chmod 777 /mnt/nfs_share/

Grant access to Client Systems.

Copy
Copied!
            

echo "/mnt/nfs_share *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports sudo exportfs -rav sudo systemctl restart nfs-server


Storage Provisioner

Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.

Copy
Copied!
            

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ --set nfs.server=172.17.171.248 \ --set nfs.path=/mnt/nfs_share


Image Pull Secret for nvcr.io

In this example, one must set his ngc-api-key, ngc-email and deployment namespace.

Copy
Copied!
            

kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'

Where:

  • ngc-api-key can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Key

  • ngc-email is the email one used to sign in above

  • namespace is the Kubernetes namespace one uses for deployment, or “default”

Deleting cluster

The following would delete your cluster when no longer needed.

Copy
Copied!
            

eksctl delete cluster --name=cluster_name

It’ll take around 10 minutes for the cluster and it’s associated services to be deleted

© Copyright 2022, NVIDIA.. Last updated on Dec 13, 2022.