NVIDIA TAO Toolkit v4.0.1
NVIDIA TAO Release 4.0.1

Setup

The TAO Toolkit API service can run on any Kubernetes platform. This section describes how to set up the TAO Toolkit API service on the following platforms: a bare-metal server, AWS (Amazon Web Services) EKS, and Azure AKS.

Hardware

Minimum Requirements

1 or more GPU node(s) where all GPUs within a given node match.

  • 32 GB system RAM

  • 32 GB of GPU RAM

  • 8 core CPU

  • 1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture

  • 16 GB of SSD space

Software

OS Support

  • Ubuntu 18.04 (fresh install)

  • Ubuntu 20.04 (fresh install)

Deployment Steps

Download resource using NGC CLI.

Copy
Copied!
            

ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.1"

Change current directory.

Copy
Copied!
            

cd tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal

Add contents of your inventory.

Notes:

  • The users must have sudo privileges.

  • You can use either password (ansible_ssh_pass) or use ssh private key file (ansible_ssh_private_key_file) for credentials.

  • For a single node cluster, you can list only the master node.

Copy
Copied!
            

vi hosts

Below is an example with user/password.

Copy
Copied!
            

[master] 127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no' [nodes] 127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Below if an example with ssh key.

  • Make sure the user account on the node(s) has sudo access.

  • Generate ssh key using ssh-keygen.

  • Populate your public key to remote node(s) using ssh-copy-id.

Copy
Copied!
            

[master] 1.1.1.1 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/home/user/.ssh/id_rsa' [nodes] 1.1.1.2 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/home/user/.ssh/id_rsa'

Optionally, one can validate ssh credentials for the node(s). A proper answer would be “root”.

Copy
Copied!
            

ssh ubuntu@127.0.0.2 'sudo whoami'

Add contents of your deployment; such as chart version, NGC credentials etc.

Copy
Copied!
            

vi tao-toolkit-api-ansible-values.yml

Below is an example.

Copy
Copied!
            

ngc_api_key: YzZtczM5amdtdDcwNjk... ngc_email: johndoe@mycorp.com api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz api_values: ./tao-toolkit-api-helm-values.yml cluster_name: tao-toolkit-api-demo

Optionally, add any values that you would like to override while installing the API chart. Most won’t need to override any values.

Copy
Copied!
            

vi tao-toolkit-api-helm-values.yml

Optionally, for Hopper (H100) GPU support, change the gpu_driver_version parameter to 520.61.07.

For Ubuntu 20.04:

Copy
Copied!
            

vi cnc/cnc_values_6.1.yaml

For Ubuntu 18.04:

Copy
Copied!
            

vi cnc/cnc_values_3.1.yaml

Proceed with deployment.

Copy
Copied!
            

bash setup.sh install

Pre-Requisites

AWS Account

If your organization has an AWS (Amazon Web Services) account that can be used to host the TAO Toolkit API service, contact your AWS account administrator to perform the next steps.

If you do not have an AWS account, you can create one yourself.

To complete the following steps, log in to the AWS web console as either the AWS account root user or a user with Admin privileges.

IAM User

Follow these steps to create an AWS IAM user, group and attach policies for automated deployment of the TAO Toolkit API.

  1. Once logged in to the AWS web console, search for and select the “IAM” service.

  2. Select Users on the left panel and click Add users.

  3. In the Add user wizard, provide an appropriate User name and select Access key - Programmatic access for the AWS credential type.

  4. Navigate to Next: Permissions > Next: Tags > Next: Review and click Create user.

  5. Click the Download button to download the access keys as .csv for use in setting up the TAO API using one-click scripts.

    Note

    Once you leave this screen, you will NOT be able to download the same credentials again.


  6. Select User groups on the left panel and click Create group.

  7. In the Create user group wizard, provide an appropriate name.

  8. In the Add users to the group - Optional section, search for and select the user created in the previous step.

  9. In the Attach permission policies - Optional section, search for and select the “AdministratorAccess” policy. Then click Create Group.

S3 Bucket

Follow these steps to create an S3 bucket to store the state.

  1. Search for and select the “S3” service.

  2. Select Buckets on left hand panel and click Create bucket.

  3. In the Create bucket wizard, provide an appropriate name for the Bucket and choose the region closest to you.

  4. Ensure ACLs are disabled and all public access is blocked.

  5. Enable Bucket Versioning and Server-side encryption with Amazon S3-managed keys, then click Create bucket.

Software

Deployment Steps

Download resource using NGC CLI.

Copy
Copied!
            

ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.1"

Change current directory.

Copy
Copied!
            

cd tao-getting-started_v4.0.1/setup/quickstart_api_aws_eks

Optionally add any values you would like to override while installing the API chart.

Copy
Copied!
            

vi tao-toolkit-api-helm-values.yml

Optionally, for Hopper (H100) GPU, change gpu_operator_version parameter to v22.9.0.

Copy
Copied!
            

vi config/main.tf

Proceed with deployment.

Copy
Copied!
            

bash setup.sh install

You will be asked to enter the following parameters:

  • S3 bucket name that you created manually from AWS console (e.g. automation-for-tao-api)

  • S3 bucket region (e.g. us-west-1)

  • You choice of cluster name (e.g. automation-for-tao-api)

  • You choice of AWS region (e.g. us-west-1)

  • Your choice of VPC CIDR (e.g. 10.0.0.0/16)

  • Path to your SSH public key (e.g. ~/.ssh/id_rsa.pub or generated from ssh-keygen command)

  • Your NGC API key

  • Your NGC account email address

  • K8s Cluster Version (defaults to 1.23)

  • AWS Instance Type (defaults to g4dn.12xlarge)

  • Number of instances of this type (defaults to 1)

  • URL of the TAO Toolkit API helm chart (defaults to latest)

  • Helm values file to override any values of the TAO Toolkit API Helm chart (e.g. tao-toolkit-api-helm-values.yml)

  • AWS Access Key ID (from pre-requisites section above)

Software

Connect to AKS Cluster

Copy
Copied!
            

az account set --subscription <subscription id> az aks get-credentials --resource-group <resource group name> --name <aks cluster name>


Configuring Kubernetes Pods to Access GPU Resources

Copy
Copied!
            

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.

Copy
Copied!
            

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

You should see all the nodes in your cluster with the number of gpu’s for that instance

vNet Setup

AKS, NFS and DSVM need to be one the same vnet. You can create a vnet and use it at creation time. For me, I create ASK and dig out the vnet and use it at NFS, DSVM creation. To find the vnet of AKS is not straight forward. I would explain it in this section.

First, open portal.azure.com and go to Virtual Machine Scale Set page.


Find the Virtual Machine Scale Set page of your AKS from Resource group Looking for MC_<resource group of your AKS>_<AKS name>_location and click it.

you can see the vnet name and you can config firewall by click network tab

Software Setup

Install NGINX Ingress Controller

Carry out the following commands on your local machine

Copy
Copied!
            

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install ingress-nginx ingress-nginx/ingress-nginx


NFS Server

We need NFS server, IT can be SaaS like Azure NFS Server or PaaS like installing NFS on top of a VM. You just need to do one of the foolowing two. No need to do both.

Azure NFS Server

  1. Create a storage account using azure portal (make sure same resource group, same loctation)

  2. Add virtual network of AKS just create to storage account

  3. Create a file share and configure, Please don’t create private endpoint, just Configure service endpoint.

  4. Add virtual network of AKS just create to fileshare

More detals at following link https://learn.microsoft.com/en-us/azure/storage/files/storage-files-quick-create-use-linux

VM-Based NFS Server

You can setup a VM-based NFS Server if you do not setup Azure NFS. Details described at Bare-Metal Setup and AWS EKS Setup.

Storage Provisioner

Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.

Copy
Copied!
            

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ --set nfs.server=<storage account name>.file.core.windows.net:/<storage account name>/<file share name> \ --set nfs.path=/mnt/nfs_share


Image Pull Secret for nvcr.io

In this example, one must set his ngc-api-key, ngc-email and deployment namespace.

Copy
Copied!
            

kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'

Where:

  • ngc-api-key can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Key

  • ngc-email is the email one used to sign in above

  • namespace is the Kubernetes namespace one uses for deployment, or “default”

© Copyright 2023, NVIDIA.. Last updated on Aug 2, 2023.