Deployment

One must use the provided Helm chart to deploy TAO Toolkit service.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-3.22.05-beta.1.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-3.22.05-beta.1.tgz -C tao-toolkit-api

If needed, one can customize the deployment by updating the chart’s tao-toolkit-api/values.yaml.

  • image is the location of the TAO Toolkit API container image

  • host, tlsSecret, corsOrigin and authClientID are for future ingress rules assuring security and privacy

  • imagePullSecret is the secret name that you setup to access Nvidia’s nvcr.io registry

  • imagePullPolicy is set to Always fetch from nvcr.io instead of using locally cached image

  • storageClassName is the storage class created by your K8s Storage Provisioner. On bare-metal deployment it is nfs-client, and on AWS EKS can be standard. Not providing a value would make your deployment use your K8s cluster’s default storage class

  • storageAccessMode is set to ReadWriteMany to reuse allocated storage between deployments, or ReadWriteOnce to create a new storage at every deployment

  • storageSize is ignored by many Storage Provisioners. But here would be where to set your shared storage size

  • backend is the platform used for training jobs. Defaults to local-k8s

  • numGpu is the number of GPU assigned to each job. Note that multi-node training is not yet supported, so one would be limited to the number of GPUs within a cluster node for now

Then deploy the API service.

helm install tao-toolkit-api tao-toolkit-api/ --namespace default

One can validate the deployment by looking for the Ready and Completed states.

kubectl get pods -n default

To debug a deployment. look for events toward the bottom of the following command.

kubectl describe pods tao-toolkit-api -n default

Common issues are:

  • GPU Operator or Storage Provisioner pods not in Ready or Completed states

  • Missing or invalid imagepullsecret