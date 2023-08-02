The following is used to deploy the TAO Toolkit API service on an existing Kubernetes cluster. You do not need these steps if you followed the previous Bare-Metal Setup or AWS EKS Setup.

One must use the provided Helm chart to deploy TAO Toolkit service.

Copy Copied! helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz --username='$oauthtoken' --password=<YOUR API KEY> mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.0.tgz -C tao-toolkit-api

If needed, one can customize the deployment by updating the chart’s tao-toolkit-api/values.yaml .

image is the location of the TAO Toolkit API container image

host , tlsSecret , corsOrigin and authClientID are for future ingress rules assuring security and privacy

imagePullSecret is the secret name that you setup to access Nvidia’s nvcr.io registry

imagePullPolicy is set to Always fetch from nvcr.io instead of using locally cached image

storageClassName is the storage class created by your K8s Storage Provisioner. On bare-metal deployment it is nfs-client, and on AWS EKS can be standard. Not providing a value would make your deployment use your K8s cluster’s default storage class

storageAccessMode is set to ReadWriteMany to reuse allocated storage between deployments, or ReadWriteOnce to create a new storage at every deployment

storageSize is ignored by many Storage Provisioners. But here would be where to set your shared storage size

backend is the platform used for training jobs. Defaults to local-k8s

numGpu is the number of GPU assigned to each job. Note that multi-node training is not yet supported, so one would be limited to the number of GPUs within a cluster node for now

telemetryOptOut can be set if you want to opt-out from NVIDIA to collect anonymous usage metrics

Then deploy the API service.

Copy Copied! helm install tao-toolkit-api tao-toolkit-api/ --namespace default

One can validate the deployment by looking for the Ready and Completed states.

Copy Copied! kubectl get pods -n default

To debug a deployment. look for events toward the bottom of the following command.

Copy Copied! kubectl describe pods tao-toolkit-api -n default

Common issues are: