Welcome to the trial of Triton Management Service on NVIDIA LaunchPad.
Triton Management Service (TMS) is a Kubernetes microservice for managing the deployment of AI models on Triton Inference Servers (TIS). The benefit of using TMS over manual or custom deployment solutions comes from TMS’s built-in understanding of Triton Inference Server, GPU hardware, and how these components interact with industry standard model frameworks such as PyTorch, TensorFlow, and ONNX. Triton Management Service strives to balance the deployment of the minimum number of Triton server instances with the performance of inference, thereby reducing idle GPU utilization and maximizing resource usage.
Within this LaunchPad lab, you will gain experience with TMS which will accelerate your path to scaling AI outcomes. This lab will walk you through the configuration and deployment of TMS on a GPU-enabled Kubernetes Cluster, show example usage of TMS, and show you how to leverage TMS to maximize resource efficiency.
Critical Benefits of TMS:
- Timed Leases
A crucial aspect of TMS’s capabilities is the ability to set a duration for each AI model to be served. This lets you specify a model to unload from the GPU after a certain amount of time, with the option to automatically extend the model’s lifetime if it’s still being used.
TMS enables you to serve AI models and automatically scale out the number of Triton instances deployed for the model based on hardware utilization or latency. With autoscaling, you can improve performance when receiving a high volume of inference requests, and TMS will handle scaling back down once the traffic decreases.
- Bin Packing
Bin Packing with TMS allows you to easily serve multiple models on the same Triton server to maximize memory efficiency and performance. TMS lifts the burden of managing which models to load on which GPU, and lets you load models in and out of the GPU with simple API calls.
Triton Management Service is a Kubernetes microservice and expects to be deployed into a Kubernetes managed cluster. To more easily facilitate its deployment into your Kubernetes cluster, NVIDIA provides a Helm chart designed to simplify the deployment of TMS.
TMS uses the concept of “leases” to specify which models to deploy Triton pods for. When you create a lease via the CLI, TMS will automatically deploy Triton pods to serve your specified models, which are stored in your model repository.
A typical workflow may look like the following diagram:
As shown in this diagram, TMS allows you to create CPU leases for your AI models. In this lab, we will only be creating and working with GPU leases.
The components and instructions used in the lab are intended to be used as examples for integration and will need to be altered as each user’s environment will differ in their path to production ready utilization. Users should be aware that the deployment of TMS should be customized and integrated into one’s infrastructure, using this lab as a reference. For example, all of the instructions in this lab assume a single node infrastructure, whereas production deployments should be performed in a high availability (HA) environment.