Autoscaling is a feature you can enable for your leases in TMS. When enabled, TMS will monitor the metrics you specify (CPU utilization, GPU utilization, or request queue time) for your Triton instance(s), and if past a threshold, TMS will deploy more Triton instances to balance the load.
For example, if you set a threshold at 70% GPU utilization, TMS will detect when your GPU utilization passes 70%, and scale the model(s) out across other GPUs in your system by automatically deploying more Triton instances. Here is a diagram demonstrating autoscaling with TMS:

You can autoscale Triton instances up to as many GPU instances (GPUs or MIG slices) are in your system. This means there will at most be a 1:1 ratio of Triton instances to GPU instances. In this lab, we have 1 x NVIDIA A30 GPU that has 2 MIG slices configured, which means we have 2 GPU instances. Therefore, we can autoscale up to 2 Triton instances. We already specified this autoscaling setting during TMS installation.
Autoscaling across MIG slices using GPU utilization is coming soon, so in this lab we will only auto-scale based on the queue time metric.
A good use case for autoscaling is where you have a model that will see a high volume/spike in requests at a certain time, and then lower traffic at other times. In this lab, we will demonstrate autoscaling using a Hugging Face transformer model for generating text embeddings. In a production environment, this example might mimic the workflow of large documents being sent to the model every so often, spiking the traffic at those times, while the request volume drops lower when not in use.
Open the SSH Console from the left pane and ensure we have no active leases before we get started.
./tmsctl lease list
If you see any active leases, you can remove them with the command
./tmsctl lease release <lease ID>
Since the model we’re going to autoscale doesn’t use the default Triton server container, we have to add our specific Triton container to TMS’s list of allowed images.
./tmsctl allowlist add nvcr.io/nvlp-aienterprise/nvlp-customers/triton-transformers:1.0
We’re going to autoscale the text embeddings ensemble model used for sensitive information detection, which is made up of three directories:
transformer_tensorrt_inference
transformer_tensorrt_model
transformer_tensorrt_tokenize
TMS supports ensemble models by allowing you to specify multiple models in your lease. Let’s take a look at some of the other
tmsctl
flags we will use for this lease:-m
this flag specifies one model for the lease and can be listed multiple times for multi-model leases--triton-image
this flag specifies which Triton image to use for the lease--enable-autoscaling
this flag turns on autoscaling for the lease--autoscaling-metric-queue-time
this flag enables queue time based autoscaling--autoscaling-metric-queue-time-threshold
this flag specifies the queue time threshold for when TMS should deploy more Triton instances
Create the autoscaling enabled lease. Since this is an ensemble model made up by three componenets, we use the
-m
flag three times to specify each component:Again, this may take a few minutes, so be patient.
./tmsctl lease create -m "name=transformer_tensorrt_model,uri=model://volume-models/text-embeddings/transformer_tensorrt_model/" -m "name=transformer_tensorrt_tokenize,uri=model://volume-models/text-embeddings/transformer_tensorrt_tokenize/" -m "name=transformer_tensorrt_inference,uri=model://volume-models/text-embeddings/transformer_tensorrt_inference/" --triton-image nvcr.io/nvlp-aienterprise/nvlp-customers/triton-transformers:1.0 --enable-autoscaling --autoscaling-metric-queue-time enable --autoscaling-metric-queue-time-threshold 20us
Again, be sure to take note of the Triton URL output from the lease creation command, and this time set it as an environment variable with the port 8000 attached for convience. For an exmaple of where the URL will appear in the output, refer to the previous section.
NoteThis time we are using the port
8000
to perform inference, don’t forget to include it in your environment variableexport triton_url=<URL from lease creation>:8000
TMS will take a few minutes to start collecting autoscaling metrics, so we need to wait before stress testing the model. You can check to see if the metrics are being collected by running the following command:
kubectl get hpa
If the metrics aren’t ready yet, you will see something like this (note the
unknown
label under targets):NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE triton-8c7a2289 Deployment/triton-8c7a2289 <unknown>/20 1 2 1 45s
Once the metrics are successfully being collected for the Triton instance, you will see something like this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE triton-8c7a2289 Deployment/triton-8c7a2289 0/20 1 2 1 120s
Once the metrics are collected, we are ready to perform some inference testing.
We will now demonstrate the autoscaling feature by using Triton’s perf_analyzer
tool to send a high volume of requests to our model and analyze the performance increase from autoscaling.
The
perf_analyzer
tool is included in the NVIDIA Triton client container, which we deployed earlier. Start a shell in ourtriton-client
pod.kubectl exec -it triton-client -- env triton_url=$triton_url bash
Create a short bash script to set up multiple
perf_analyzer
testscat << 'EOF' > inference-tests.sh if [ $# -eq 0 ]; then echo "Error: Argument not provided. Usage: $0 <argument>" exit 1 fi echo "Running stress tests" perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output1.txt & perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output2.txt & wait perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output3.txt & perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output4.txt & wait echo "Done running stress tests" EOF
Run the stress test on the model. This will take 5-10 minutes, so be patient.
bash inference-tests.sh $triton_url
Examine the outputs of the
perf_analyzer
runs. Your exact results may vary, but notice that the inferences per second increases in the second two stress tests. This is because TMS recognized the high volume of inference requests and scaled the model out across both of our GPUs.tail -n 1 output1.txt output2.txt output3.txt output4.txt
Here is an example of what your output may look like. Notice how the performance increases in the second two stress tests:
Now let’s stress test the model without autoscaling. First, exit out of the
triton-client
pod.exit
Once outside of the
triton-client
pod, remove the autoscaling lease and recreate it without autoscaling. You can get the lease ID by using./tmsctl lease list
./tmsctl lease release <lease-ID>
./tmsctl lease create -m "name=transformer_tensorrt_model,uri=model://volume-models/text-embeddings/transformer_tensorrt_model/" -m "name=transformer_tensorrt_tokenize,uri=model://volume-models/text-embeddings/transformer_tensorrt_tokenize/" -m "name=transformer_tensorrt_inference,uri=model://volume-models/text-embeddings/transformer_tensorrt_inference/" --triton-image nvcr.io/nvlp-aienterprise/nvlp-customers/triton-transformers:1.0
Once again, be sure to take note of the Triton URL output from the lease creation command, and set it as an environment variable for convenience
export triton_url=<URL from lease creation command>
Re-enter our Triton SDK pod to rerun the stress tests without autoscaling
kubectl exec -it triton-client -- env triton_url=$triton_url bash
The bash script we created should still exist in the pod, so we can reuse it with our new Triton URL
bash inference-tests.sh $triton_url
Now examine the outputs of the
perf_analyzer
runs without autoscaling. Notice that we are now no longer seeing the inferences per second increase in the later stress tests, since autoscaling is disabled.tail -n 1 output1.txt output2.txt output3.txt output4.txt