The following is used to deploy the TAO Toolkit API service on an existing Kubernetes cluster. You do not need these steps if you followed the previous Bare-Metal Setup or AWS EKS Setup.

One must use the provided Helm chart to deploy TAO Toolkit service.


helm fetch --username='$oauthtoken' --password=<YOUR API KEY> mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.0.tgz -C tao-toolkit-api

If needed, one can customize the deployment by updating the chart’s tao-toolkit-api/values.yaml.

  • image is the location of the TAO Toolkit API container image

  • host, tlsSecret, corsOrigin and authClientID are for future ingress rules assuring security and privacy

  • imagePullSecret is the secret name that you setup to access Nvidia’s registry

  • imagePullPolicy is set to Always fetch from instead of using locally cached image

  • storageClassName is the storage class created by your K8s Storage Provisioner. On bare-metal deployment it is nfs-client, and on AWS EKS can be standard. Not providing a value would make your deployment use your K8s cluster’s default storage class

  • storageAccessMode is set to ReadWriteMany to reuse allocated storage between deployments, or ReadWriteOnce to create a new storage at every deployment

  • storageSize is ignored by many Storage Provisioners. But here would be where to set your shared storage size

  • backend is the platform used for training jobs. Defaults to local-k8s

  • numGpu is the number of GPU assigned to each job. Note that multi-node training is not yet supported, so one would be limited to the number of GPUs within a cluster node for now

  • telemetryOptOut can be set if you want to opt-out from NVIDIA to collect anonymous usage metrics

Then deploy the API service.


helm install tao-toolkit-api tao-toolkit-api/ --namespace default

One can validate the deployment by looking for the Ready and Completed states.


kubectl get pods -n default

To debug a deployment. look for events toward the bottom of the following command.


kubectl describe pods tao-toolkit-api -n default

Common issues are:

  • GPU Operator or Storage Provisioner pods not in Ready or Completed states

  • Missing or invalid imagepullsecret

© Copyright 2023, NVIDIA.. Last updated on Aug 2, 2023.