Setup
The TAO Toolkit API service can run on any Kubernetes platform. The two platforms officially supported are AWS EKS and Bare-Metal.
Hardware minimum requirements
1 or more GPU node(s) where all GPUs within a given node match.
32 GB system RAM
32 GB of GPU RAM
8 core CPU
1 NVIDIA Discrete GPU: Volta, Turing, Ampere architecture
16 GB of SSD space
Software requirement
Before installing Kubernetes, each node in the cluster should be setup with a fresh Ubuntu 18.04 or latest.
No NVIDIA driver
List installed drivers.
apt list --installed | grep '^nvidia-.*'
Remove any installed.
sudo apt-get remove --purge '^nvidia-.*'
No nouveau driver
Blacklist the nouveau driver.
sudo echo -e "\nblacklist nouveau\noptions nouveau modeset=0\n" >> /etc/modprobe.d/blacklist-nouveau.conf
Regenerate the kernel initramfs.
sudo update-initramfs -u
Reboot.
sudo reboot
SSH Passwordless
Before installing Kubernetes, one needs to setup ssh-passwordless so the node used for driving the installation can run remote commands on each node of your cluster.
First, one must make sure an ssh key has been generated for the local current user.
[ ! -f ~/.ssh/id_rsa ] && ssh-keygen -t rsa -b 4096
Then, one must allow this user to login as a sudoer user on each node of your cluster.
ssh-copy-id $USER@node1
If the user is not already a sudoer on each node of your cluster, then one must login as root on each node and add the user to the node’s /etc/sudoers file.
echo "$user ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
Kubernetes
We suggest deploying Kubernetes using Kubespray. The most recent tested version is v2.18.0.
First, check /etc/resolv.conf of each node in your cluster and make sure only one (or two) domain is specified as follows.
cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search nvidia.com nvc.nvidia.com
Then, swap should be disabled for each node on your cluster.
sudo swapoff -a
sudo sed -i.bak '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
On your node driving the installation, get a local copy of the Kubespray project.
git clone https://github.com/kubernetes-sigs/kubespray.git
cd kubespray
git checkout v2.18.0
Make sure ansible is not already installed from Ubuntu distro.
sudo apt remove ansible
Install dependencies.
sudo pip3 install -r requirements.txt
Create a new inventory.
cp -rfp inventory/sample inventory/mycluster
Customize your inventory. Please use the IP addresse(s) of the node(s) in your cluster.
declare -a IPS=(172.17.171.248 172.17.171.249 172.17.171.250)
CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
Proceed with the installation of Kubernetes in your cluster.
ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
If something went wront, one can undo the cluster installation with:
ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root reset.yml
After a sucessful installation, one must copy the kube config file for the cluster admin user. Below command implies the installation node is one of your cluster node.
mkdir ~/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $USER ~/.kube/config
Most popular Kubernetes commands are the following. Note the -n
option to specify the namespace is optional with default
as the default namespace name. The -A
option is for all namespaces.
kubectl get nodes
kubectl get pods -A
kubectl get pods -n default
kubectl get services -n default
kubectl get storageclasses -n default
kubectl get persistentvolumes -n default
kubectl get persistentvolumeclaims -n default
kubectl get pvc -n default
kubectl describe pods tao-toolkit-api -n default
kubectl logs tao-toolkit-api -n default
For a more exhaustive cheat sheet, please refer to: https://kubernetes.io/docs/reference/kubectl/cheatsheet/
Helm
Several online sites provide tutorial and cheat sheets for Helm commands, for example: https://www.tutorialworks.com/helm-cheatsheet/
One can install Helm manually with:
sudo apt-get install helm
NVIDIA GPU Operator
One can make sure nodes are schedulable with:
kubectl get nodes -o name | xargs kubectl uncordon
If GPU Operator was previously uninstalled, you might need to run the following before a new install:
kubectl delete crd clusterpolicies.nvidia.com
One can deploy newer gpu-operator with:
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator --set driver.repository=nvcr.io/nvidia --set driver.version="510.47.03"
In case Kubernetes default runtime is containerd
, one must add the parameter --set operator.defaultRuntime="containerd"
to the above command.
You can wait a few minutes and check that all GPU related pods are in good health, meaning Running or Terminated states:
kubectl get pods -A
If the GPU pods are failing, check once more that no NVIDIA or nouveau drivers are installed on the nodes.
Set Node Labels
List nodes and labels with:
kubectl get nodes --show-labels
Then, set node label with (example):
kubectl label nodes node1 accelerator=a100
NGINX Ingress Controller
One can install the ingress controller with:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx
NFS Server
If you do not have an NFS share already available, below is the proceedure for installing one.
Install the server package.
sudo apt install nfs-kernel-server
Create an Export Directory.
sudo mkdir -p /mnt/nfs_share
sudo chown -R nobody:nogroup /mnt/nfs_share/
sudo chmod 777 /mnt/nfs_share/
Grant access to Client Systems.
sudo echo "/mnt/nfs_share *(rw,sync,no_subtree_check)" >> /etc/exports
sudo exportfs -a
sudo systemctl restart nfs-kernel-server
Storage Provisioner
Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--set nfs.server=172.17.171.248 \
--set nfs.path=/mnt/nfs_share
Image Pull Secret for nvcr.io
In this example, one must set his ngc-api-key, ngc-email and deployment namespace.
kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'
Where:
ngc-api-key
can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Keyngc-email
is the email one used to sign in abovenamespace
is the Kubernetes namespace one uses for deployment, or “default”
Pre-requisite installation on your local machine
Install AWS CLI
Follow this link to install aws cli on your platform
Install kubectl
Follow this link to install kubectl on your local machine
Install eksctl
Follow this link to install eksctl on your local machine
Install helm
Follow this link to install helm on your local machine
Cluster setup
AWS CLI Configure
aws configure
The ‘Access Key ID’ and ‘Secret Access Key’ are obtained from the csv file which you can download while creating an aws account.
region name: an aws availability zone from here which has the services and instances you like, for example p3.2xlarge ec2 instance is present in us-west-2 and not present in us-west-1
output format: json
Create EC2 Key Pair
Create a ec2 key pair for enabling a ssh connection to the Kubernetes worker nodes
aws ec2 create-key-pair --key-name key_name --key-type rsa --query "KeyMaterial" --output text > key_name.pem
aws ec2 create-key-pair --key-name ec2_dev_key --key-type rsa --query "KeyMaterial" --output text > ec2_dev_key.pem
Change the permission of the pem key
chmod 400 key_name.pem
To confirm the creation of the key login to the AWS console (webpage)
The link for login will be present in the csv file you use to obtain the secret and access key ID for ‘aws configure’
Choose the region appropriately in the top right-hand corner as per the one you used at ‘aws configure’ step
In the top left search box near services and type ‘key pairs’ and select ‘Key Pairs’ associated with EC2 feature.
You should see your key present there, if the creation was successful
EKS Cluster Creation
eksctl create cluster --name cluster_name --node-type ec2_instance_type --nodes num_worker_nodes \
--region aws_availability_zone --ssh-access --ssh-public-key key_name
eksctl create cluster --name tao-eks-cluster --node-type p3.2xlarge --nodes 1 \
--region us-west-2 --ssh-access --ssh-public-key ec2_dev_key
It’ll take around 15 min for the cluster creation to complete
You can verify the cluster creation on AWS Console by navigating to ‘Elastic Kubernetes Service’ under AWS Services and then select the ‘Cluster’ feature
You’ll see your cluster name on the webpage
Also, you can go to the ‘EC2’ service and then select ‘Instances’ feature, to see your worker node instance
Configuring Kubernetes pods to access GPU resources
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
You should see all the nodes in your cluster with the number of gpu’s for that instance
Software setup
Install NGINX Ingress Controller
Carry out the following commands on your local machine
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx
Set Node Labels
List nodes and labels with:
kubectl get nodes --show-labels
Then, set node label with (example):
kubectl label nodes node1 accelerator=v100
NFS Server
If you do not have an NFS share already available within your cluster, below is the proceedure for installing one on a RH-based EC2 instance. Please plan for enough storage to accomodate all your datasets and model exeriments.
Install the nfs package.
sudo yum install nfs-utils
sudo systemctl enable --now nfs-server rpcbind
Create an Export Directory.
sudo mkdir -p /mnt/nfs_share
sudo chown -R nobody:nogroup /mnt/nfs_share/
sudo chmod 777 /mnt/nfs_share/
Grant access to Client Systems.
sudo echo "/mnt/nfs_share *(rw,sync,no_subtree_check)" >> /etc/exports
sudo exportfs -rav
sudo systemctl restart nfs-server
Storage Provisioner
Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--set nfs.server=172.17.171.248 \
--set nfs.path=/mnt/nfs_share
Image Pull Secret for nvcr.io
In this example, one must set his ngc-api-key, ngc-email and deployment namespace.
kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'
Where:
ngc-api-key
can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Keyngc-email
is the email one used to sign in abovenamespace
is the Kubernetes namespace one uses for deployment, or “default”
Deleting cluster
The following would delete your cluster when no longer needed.
eksctl delete cluster --name=cluster_name
It’ll take around 10 minutes for the cluster and it’s associated services to be deleted