Deployment#

The following is used to deploy or update the TAO API service on an existing Kubernetes cluster. One can use the following to enable HTTPS and enforce user authentication to enable secure multi-tenancy. You do not need these steps if you followed one of the Platform Setup and do not wish to enable secure multi-tenancy.

One can shutdown an already deployed TAO service.

helm delete tao-api

One must use the provided Helm chart to deploy TAO services.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-api-5.5.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-api && tar -zxvf tao-api-5.5.0.tgz -C tao-api

If needed, one can customize the deployment by updating the chart’s tao-api/values.yaml.

image is the location of the TAO API container image
host, tlsSecret are for enabling HTTPS and enforcing user authentication, enabling secure multi-tenancy
corsOrigin is for enabling CORS and setting origin
authClientID is reserved for future NVIDIA Starfleet authentication
imagePullSecret is the secret name that you setup to access Nvidia’s nvcr.io registry
imagePullPolicy is set to Always fetch from nvcr.io instead of using locally cached image
storageClassName is the storage class created by your K8s Storage Provisioner. On bare-metal deployment it is nfs-client, and on AWS EKS can be standard. Not providing a value would make your deployment use your K8s cluster’s default storage class
storageAccessMode is set to ReadWriteMany to reuse allocated storage between deployments, or ReadWriteOnce to create a new storage at every deployment
storageSize is ignored by many Storage Provisioners. But here would be where to set your shared storage size
backend is the platform used for training jobs. Defaults to local-k8s
maxNumGpuPerNode is the number of GPU assigned to each job. Multi-node training is not supported, you are limited to the number of GPUs within a cluster node
telemetryOptOut can be set if you want to opt-out from NVIDIA to collect anonymous usage metrics

Example for creating a tlsSecret:

openssl req -x509 -sha256 -nodes -days 365 -newkey rsa:2048 -keyout tls.key -out tls.crt -subj "/CN=ec2-34-221-205-157.us-west-2.compute.amazonaws.com/O=ec2-34-221-205-157.us-west-2.compute.amazonaws.com" --addext "subjectAltName = DNS:ec2-34-221-205-157.us-west-2.compute.amazonaws.com"
kubectl create secret tls tls-secret --key tls.key --cert tls.crt --namespace default

Then deploy the API service.

helm install tao-api tao-api/ --namespace default

One can validate the deployment by looking for the Ready and Completed states.

kubectl get pods -n default

To debug a deployment. look for events toward the bottom of the following command.

kubectl describe pods tao-api -n default

Common issues are:

GPU Operator or Storage Provisioner pods not in Ready or Completed states
Missing or invalid imagepullsecret