Setup
The TAO Toolkit API service can run on any Kubernetes platform. This section describes how to set up the TAO Toolkit API service on the following platforms: a bare-metal server, AWS (Amazon Web Services) EKS, and Azure AKS.
Hardware
Minimum Requirements
1 or more GPU node(s) where all GPUs within a given node match.
32 GB system RAM
32 GB of GPU RAM
8 core CPU
1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture
16 GB of SSD space
Software
OS Support
Ubuntu 18.04 (fresh install)
Ubuntu 20.04 (fresh install)
Deployment Steps
Download resource using NGC CLI.
ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.1"
Change current directory.
cd tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal
Add contents of your inventory.
Notes:
The users must have sudo privileges.
You can use either password (ansible_ssh_pass) or use ssh private key file (ansible_ssh_private_key_file) for credentials.
For a single node cluster, you can list only the master node.
vi hosts
Below is an example with user/password.
[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'
[nodes]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'
Below if an example with ssh key.
Make sure the user account on the node(s) has sudo access.
Generate ssh key using ssh-keygen.
Populate your public key to remote node(s) using ssh-copy-id.
[master]
1.1.1.1 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/home/user/.ssh/id_rsa'
[nodes]
1.1.1.2 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/home/user/.ssh/id_rsa'
Optionally, one can validate ssh credentials for the node(s). A proper answer would be “root”.
ssh ubuntu@127.0.0.2 'sudo whoami'
Add contents of your deployment; such as chart version, NGC credentials etc.
vi tao-toolkit-api-ansible-values.yml
Below is an example.
ngc_api_key: YzZtczM5amdtdDcwNjk...
ngc_email: johndoe@mycorp.com
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
api_values: ./tao-toolkit-api-helm-values.yml
cluster_name: tao-toolkit-api-demo
Optionally, add any values that you would like to override while installing the API chart. Most won’t need to override any values.
vi tao-toolkit-api-helm-values.yml
Optionally, for Hopper (H100) GPU support, change the gpu_driver_version parameter to 520.61.07.
For Ubuntu 20.04:
vi cnc/cnc_values_6.1.yaml
For Ubuntu 18.04:
vi cnc/cnc_values_3.1.yaml
Proceed with deployment.
bash setup.sh install
Pre-Requisites
AWS Account
If your organization has an AWS (Amazon Web Services) account that can be used to host the TAO Toolkit API service, contact your AWS account administrator to perform the next steps.
If you do not have an AWS account, you can create one yourself.
To complete the following steps, log in to the AWS web console as either the AWS account root user or a user with Admin privileges.
IAM User
Follow these steps to create an AWS IAM user, group and attach policies for automated deployment of the TAO Toolkit API.
Once logged in to the AWS web console, search for and select the “IAM” service.
Select Users on the left panel and click Add users.
In the Add user wizard, provide an appropriate User name and select Access key - Programmatic access for the AWS credential type.
Navigate to Next: Permissions > Next: Tags > Next: Review and click Create user.
Click the Download button to download the access keys as
.csv
for use in setting up the TAO API using one-click scripts.NoteOnce you leave this screen, you will NOT be able to download the same credentials again.
Select User groups on the left panel and click Create group.
In the Create user group wizard, provide an appropriate name.
In the Add users to the group - Optional section, search for and select the user created in the previous step.
In the Attach permission policies - Optional section, search for and select the “AdministratorAccess” policy. Then click Create Group.
S3 Bucket
Follow these steps to create an S3 bucket to store the state.
Search for and select the “S3” service.
Select Buckets on left hand panel and click Create bucket.
In the Create bucket wizard, provide an appropriate name for the Bucket and choose the region closest to you.
Ensure ACLs are disabled and all public access is blocked.
Enable Bucket Versioning and Server-side encryption with Amazon S3-managed keys, then click Create bucket.
Software
Deployment Steps
Download resource using NGC CLI.
ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.1"
Change current directory.
cd tao-getting-started_v4.0.1/setup/quickstart_api_aws_eks
Optionally add any values you would like to override while installing the API chart.
vi tao-toolkit-api-helm-values.yml
Optionally, for Hopper (H100) GPU, change gpu_operator_version parameter to v22.9.0.
vi config/main.tf
Proceed with deployment.
bash setup.sh install
You will be asked to enter the following parameters:
S3 bucket name that you created manually from AWS console (e.g. automation-for-tao-api)
S3 bucket region (e.g. us-west-1)
You choice of cluster name (e.g. automation-for-tao-api)
You choice of AWS region (e.g. us-west-1)
Your choice of VPC CIDR (e.g. 10.0.0.0/16)
Path to your SSH public key (e.g. ~/.ssh/id_rsa.pub or generated from ssh-keygen command)
Your NGC API key
Your NGC account email address
K8s Cluster Version (defaults to 1.23)
AWS Instance Type (defaults to g4dn.12xlarge)
Number of instances of this type (defaults to 1)
URL of the TAO Toolkit API helm chart (defaults to latest)
Helm values file to override any values of the TAO Toolkit API Helm chart (e.g. tao-toolkit-api-helm-values.yml)
AWS Access Key ID (from pre-requisites section above)
Software
Connect to AKS Cluster
az account set --subscription <subscription id>
az aks get-credentials --resource-group <resource group name> --name <aks cluster name>
Configuring Kubernetes Pods to Access GPU Resources
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
You should see all the nodes in your cluster with the number of gpu’s for that instance
vNet Setup
AKS, NFS and DSVM need to be one the same vnet. You can create a vnet and use it at creation time. For me, I create ASK and dig out the vnet and use it at NFS, DSVM creation. To find the vnet of AKS is not straight forward. I would explain it in this section.
First, open portal.azure.com and go to Virtual Machine Scale Set page.
Find the Virtual Machine Scale Set page of your AKS from Resource group Looking for MC_<resource group of your AKS>_<AKS name>_location and click it.
you can see the vnet name and you can config firewall by click network tab
Software Setup
Install NGINX Ingress Controller
Carry out the following commands on your local machine
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx
NFS Server
We need NFS server, IT can be SaaS like Azure NFS Server or PaaS like installing NFS on top of a VM. You just need to do one of the foolowing two. No need to do both.
Azure NFS Server
Create a storage account using azure portal (make sure same resource group, same loctation)
Add virtual network of AKS just create to storage account
Create a file share and configure, Please don’t create private endpoint, just Configure service endpoint.
Add virtual network of AKS just create to fileshare
More detals at following link https://learn.microsoft.com/en-us/azure/storage/files/storage-files-quick-create-use-linux
VM-Based NFS Server
You can setup a VM-based NFS Server if you do not setup Azure NFS. Details described at Bare-Metal Setup and AWS EKS Setup.
Storage Provisioner
Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--set nfs.server=<storage account name>.file.core.windows.net:/<storage account name>/<file share name> \
--set nfs.path=/mnt/nfs_share
Image Pull Secret for nvcr.io
In this example, one must set his ngc-api-key, ngc-email and deployment namespace.
kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'
Where:
ngc-api-key
can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Keyngc-email
is the email one used to sign in abovenamespace
is the Kubernetes namespace one uses for deployment, or “default”