TAO Toolkit API is a Kubernetes service that enables building end-to-end AI models using custom datasets. In addition to exposing TAO Toolkit functionality through APIs, the service also enables a client to build end-to-end workflows - creating datasets, models, obtaining pretrained models from NGC, obtaining default specs, training, evaluating, optimizing and exporting models for deployment on edge. It can be easily installed on a Kubernetes cluster (local / AWS EKS) using a Helm chart along with minimal dependencies. TAO toolkit jobs can be run using GPUs available on the cluster and can scale to a multi-node setting.
One can develop client applications on top of the provided API, such as a Web-UI application, or use the provided TAO remote client CLI.
The API allows users to create datasets and upload their data to the service. Users then create models and can create experiments by linking models to train, eval and inference datasets. Actions such as train, evaluate, prune, retrain, export and inference can be spawned through simple API calls. For each action, the user can obtain default specs using a HTTP GET and POST the spec they prefer for that action. The specs are in the JSON format. Another unique feature of the Service is the ability to chain jobs. For example, a user can run train and evaluate using a single API call. This abstracts away complex directory manipulations and dependency checks. The service exposes a Job API which allows a user to cancel, download and monitor jobs. Job APIs also provide useful information such as epoch number, accuracy, loss values and ETA information. Further, the service demarcates different users inside a cluster and can protect read-write access.
The TAO remote client is an easy to use Command line interface that uses API calls to expose an interface similar to TAO Launcher CLI.
The use cases for REST API are 3rd party web-UI cloud services, and the user cases for the remote client are training farms, internal model production system, research projects, etc…
The API service can run on any Kubernetes platform. The two platforms officially supported are Bare-Metal and AWS EKS.
TAO API launches a CronJob every midnight (the time zone is determined by the node where the pod is scheduled to run). The CronJob updates the available PTMs, allowing you to use new PTMs without needing to rebuild and redeploy the TAO Service. TAO API performs version compatability checks while listing new PTMs: For example, if you are using a 5.0 release, and a model can only be used with the 5.2 release of TAO Toolkit, then the PTM list endpoint is not updated to include the model.