Deployment
One must use the provided Helm chart to deploy TAO Toolkit service.
helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-3.22.05-beta.1.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-3.22.05-beta.1.tgz -C tao-toolkit-api
If needed, one can customize the deployment by updating the chart’s tao-toolkit-api/values.yaml.
imageis the location of the TAO Toolkit API container imagehost,tlsSecret,corsOriginandauthClientIDare for future ingress rules assuring security and privacyimagePullSecretis the secret name that you setup to access Nvidia’s nvcr.io registryimagePullPolicyis set to Always fetch from nvcr.io instead of using locally cached imagestorageClassNameis the storage class created by your K8s Storage Provisioner. On bare-metal deployment it is nfs-client, and on AWS EKS can be standard. Not providing a value would make your deployment use your K8s cluster’s default storage classstorageAccessModeis set to ReadWriteMany to reuse allocated storage between deployments, or ReadWriteOnce to create a new storage at every deploymentstorageSizeis ignored by many Storage Provisioners. But here would be where to set your shared storage sizebackendis the platform used for training jobs. Defaults to local-k8snumGpuis the number of GPU assigned to each job. Note that multi-node training is not yet supported, so one would be limited to the number of GPUs within a cluster node for now
Then deploy the API service.
helm install tao-toolkit-api tao-toolkit-api/ --namespace default
One can validate the deployment by looking for the Ready and Completed states.
kubectl get pods -n default
To debug a deployment. look for events toward the bottom of the following command.
kubectl describe pods tao-toolkit-api -n default
Common issues are:
GPU Operator or Storage Provisioner pods not in Ready or Completed states
Missing or invalid imagepullsecret