Setup

The TAO Toolkit API service can run on any Kubernetes platform. This section describes how to set up the TAO Toolkit API service on the following platforms: a bare-metal server, AWS (Amazon Web Services) EKS, and Azure AKS.

Bare-Metal Setup

Hardware

Minimum Requirements

1 or more GPU node(s) where all GPUs within a given node match.

  • 32 GB system RAM

  • 32 GB of GPU RAM

  • 8 core CPU

  • 1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture

  • 16 GB of SSD space

Software

OS Support

  • Ubuntu 18.04 (fresh install)

  • Ubuntu 20.04 (fresh install)

Deployment Steps

Download resource using NGC CLI.

ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.0"

Change current directory.

cd tao-getting-started_v4.0.0/setup/quickstart_api_bare_metal

Add contents of your inventory.

Notes:

  • The users must have sudo privileges.

  • You can use either password (ansible_ssh_pass) or use ssh private key file (ansible_ssh_private_key_file) for credentials.

  • For a single node cluster, you can list only the master node.

vi hosts

Below is an example with user/password.

[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'
[nodes]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Below if an example with ssh key.

  • Make sure the user account on the node(s) has sudo access.

  • Generate ssh key using ssh-keygen.

  • Populate your public key to remote node(s) using ssh-copy-id.

[master]
1.1.1.1 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/home/user/.ssh/id_rsa'
[nodes]
1.1.1.2 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/home/user/.ssh/id_rsa'

Optionally, one can validate ssh credentials for the node(s). A proper answer would be “root”.

ssh ubuntu@127.0.0.2 'sudo whoami'

Add contents of your deployment; such as chart version, NGC credentials etc.

vi tao-toolkit-api-ansible-values.yml

Below is an example.

ngc_api_key: YzZtczM5amdtdDcwNjk...
ngc_email: johndoe@mycorp.com
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
api_values: ./tao-toolkit-api-helm-values.yml
cluster_name: tao-toolkit-api-demo

Optionally, add any values that you would like to override while installing the API chart. Most won’t need to override any values.

vi tao-toolkit-api-helm-values.yml

Optionally, for Hopper (H100) GPU support, change the gpu_driver_version parameter to 520.61.07.

For Ubuntu 20.04:

vi cnc/cnc_values_6.1.yaml

For Ubuntu 18.04:

vi cnc/cnc_values_3.1.yaml

Proceed with deployment.

bash setup.sh install

AWS EKS Setup

Pre-Requisites

AWS Account

If your organization has an AWS (Amazon Web Services) account that can be used to host the TAO Toolkit API service, contact your AWS account administrator to perform the next steps.

If you do not have an AWS account, you can create one yourself.

To complete the following steps, log in to the AWS web console as either the AWS account root user or a user with Admin privileges.

IAM User

Follow these steps to create an AWS IAM user, group and attach policies for automated deployment of the TAO Toolkit API.

  1. Once logged in to the AWS web console, search for and select the “IAM” service.

    ../../_images/eks-image001.png
  2. Select Users on the left panel and click Add users.

    ../../_images/eks-image003.png
  3. In the Add user wizard, provide an appropriate User name and select Access key - Programmatic access for the AWS credential type.

    ../../_images/eks-image005.png
  4. Navigate to Next: Permissions > Next: Tags > Next: Review and click Create user.

    ../../_images/eks-image007.png
  5. Click the Download button to download the access keys as .csv for use in setting up the TAO API using one-click scripts.

    Note

    Once you leave this screen, you will NOT be able to download the same credentials again.

    ../../_images/eks-image009.png
  6. Select User groups on the left panel and click Create group.

    ../../_images/eks-image011.png
  7. In the Create user group wizard, provide an appropriate name.

    ../../_images/eks-image013.png
  8. In the Add users to the group - Optional section, search for and select the user created in the previous step.

    ../../_images/eks-image015.png
  9. In the Attach permission policies - Optional section, search for and select the “AdministratorAccess” policy. Then click Create Group.

    ../../_images/eks-image017.png

S3 Bucket

Follow these steps to create an S3 bucket to store the state.

  1. Search for and select the “S3” service.

    ../../_images/eks-image019.png
  2. Select Buckets on left hand panel and click Create bucket.

    ../../_images/eks-image021.png
  3. In the Create bucket wizard, provide an appropriate name for the Bucket and choose the region closest to you.

    ../../_images/eks-image023.png
  4. Ensure ACLs are disabled and all public access is blocked.

    ../../_images/eks-image025.png
  5. Enable Bucket Versioning and Server-side encryption with Amazon S3-managed keys, then click Create bucket.

    ../../_images/eks-image027.png

Software

Deployment Steps

Download resource using NGC CLI.

ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.0"

Change current directory.

cd tao-getting-started_v4.0.0/setup/quickstart_api_aws_eks

Optionally add any values you would like to override while installing the API chart.

vi tao-toolkit-api-helm-values.yml

Optionally, for Hopper (H100) GPU, change gpu_operator_version parameter to v22.9.0.

vi config/main.tf

Proceed with deployment.

bash setup.sh install

You will be asked to enter the following parameters:

  • S3 bucket name that you created manually from AWS console (e.g. automation-for-tao-api)

  • S3 bucket region (e.g. us-west-1)

  • You choice of cluster name (e.g. automation-for-tao-api)

  • You choice of AWS region (e.g. us-west-1)

  • Your choice of VPC CIDR (e.g. 10.0.0.0/16)

  • Path to your SSH public key (e.g. ~/.ssh/id_rsa.pub or generated from ssh-keygen command)

  • Your NGC API key

  • Your NGC account email address

  • K8s Cluster Version (defaults to 1.23)

  • AWS Instance Type (defaults to g4dn.12xlarge)

  • Number of instances of this type (defaults to 1)

  • URL of the TAO Toolkit API helm chart (defaults to latest)

  • Helm values file to override any values of the TAO Toolkit API Helm chart (e.g. tao-toolkit-api-helm-values.yml)

  • AWS Access Key ID (from pre-requisites section above)

Azure AKS Setup

Software

Connect to AKS Cluster

az account set --subscription <subscription id>
az aks get-credentials --resource-group <resource group name> --name <aks cluster name>

Configuring Kubernetes Pods to Access GPU Resources

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

You should see all the nodes in your cluster with the number of gpu’s for that instance

vNet Setup

AKS, NFS and DSVM need to be one the same vnet. You can create a vnet and use it at creation time. For me, I create ASK and dig out the vnet and use it at NFS, DSVM creation. To find the vnet of AKS is not straight forward. I would explain it in this section.

First, open portal.azure.com and go to Virtual Machine Scale Set page.

../../_images/vmss.png

Find the Virtual Machine Scale Set page of your AKS from Resource group Looking for MC_<resource group of your AKS>_<AKS name>_location and click it.

you can see the vnet name and you can config firewall by click network tab

../../_images/vmss_setting.png

Software Setup

Install NGINX Ingress Controller

Carry out the following commands on your local machine

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx

NFS Server

We need NFS server, IT can be SaaS like Azure NFS Server or PaaS like installing NFS on top of a VM. You just need to do one of the foolowing two. No need to do both.

Azure NFS Server

  1. Create a storage account using azure portal (make sure same resource group, same loctation)

    ../../_images/addfileshare.png
  2. Add virtual network of AKS just create to storage account

    ../../_images/addvnet.png
  3. Create a file share and configure, Please don’t create private endpoint, just Configure service endpoint.

    ../../_images/configurenetworksecurity.png
  4. Add virtual network of AKS just create to fileshare

    ../../_images/addvnet2.png

More detals at following link https://learn.microsoft.com/en-us/azure/storage/files/storage-files-quick-create-use-linux

VM-Based NFS Server

You can setup a VM-based NFS Server if you do not setup Azure NFS. Details described at Bare-Metal Setup and AWS EKS Setup.

Storage Provisioner

Below is an example with local NFS (requires a local NFS server). One must replace the NFS server IP and exported path.

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --set nfs.server=<storage account name>.file.core.windows.net:/<storage account name>/<file share name> \
  --set nfs.path=/mnt/nfs_share

Image Pull Secret for nvcr.io

In this example, one must set his ngc-api-key, ngc-email and deployment namespace.

kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password='ngc-api-key' --docker-email='ngc-email' --namespace='default'

Where:

  • ngc-api-key can be obtained from https://catalog.ngc.nvidia.com/ after signing in, by selecting Setup Menu :: API KEY :: Get API KEY :: Generate API Key

  • ngc-email is the email one used to sign in above

  • namespace is the Kubernetes namespace one uses for deployment, or “default”