Step #5: Autoscaling with TMS

AI Model Orchestration with Triton Management Service (Latest)

Autoscaling is a feature you can enable for your leases in TMS. When enabled, TMS will monitor the metrics you specify (CPU utilization, GPU utilization, or request queue time) for your Triton instance(s), and if past a threshold, TMS will deploy more Triton instances to balance the load.

For example, if you set a threshold at 70% GPU utilization, TMS will detect when your GPU utilization passes 70%, and scale the model(s) out across other GPUs in your system by automatically deploying more Triton instances. Here is a diagram demonstrating autoscaling with TMS:

autoscaling-diagram.png

Note

You can autoscale Triton instances up to as many GPU instances (GPUs or MIG slices) are in your system. This means there will at most be a 1:1 ratio of Triton instances to GPU instances. In this lab, we have 1 x NVIDIA A30 GPU that has 2 MIG slices configured, which means we have 2 GPU instances. Therefore, we can autoscale up to 2 Triton instances. We already specified this autoscaling setting during TMS installation.

Autoscaling across MIG slices using GPU utilization is coming soon, so in this lab we will only auto-scale based on the queue time metric.

Important

TMS only supports a MIG strategy of single.

A good use case for autoscaling is where you have a model that will see a high volume/spike in requests at a certain time, and then lower traffic at other times. In this lab, we will demonstrate autoscaling using a Hugging Face transformer model for generating text embeddings. In a production environment, this example might mimic the workflow of large documents being sent to the model every so often, spiking the traffic at those times, while the request volume drops lower when not in use.

  1. Open the SSH Console from the left pane and ensure we have no active leases before we get started.

    Copy
    Copied!
                

    ./tmsctl lease list

    If you see any active leases, you can remove them with the command ./tmsctl lease release <lease ID>

  2. Since the model we’re going to autoscale doesn’t use the default Triton server container, we have to add our specific Triton container to TMS’s list of allowed images.

    Copy
    Copied!
                

    ./tmsctl allowlist add nvcr.io/nvlp-aienterprise/nvlp-customers/triton-transformers:1.0


  3. We’re going to autoscale the text embeddings ensemble model used for sensitive information detection, which is made up of three directories:

    • transformer_tensorrt_inference

    • transformer_tensorrt_model

    • transformer_tensorrt_tokenize

    TMS supports ensemble models by allowing you to specify multiple models in your lease. Let’s take a look at some of the other tmsctl flags we will use for this lease:

    • -m this flag specifies one model for the lease and can be listed multiple times for multi-model leases

    • --triton-image this flag specifies which Triton image to use for the lease

    • --enable-autoscaling this flag turns on autoscaling for the lease

    • --autoscaling-metric-queue-time this flag enables queue time based autoscaling

    • --autoscaling-metric-queue-time-threshold this flag specifies the queue time threshold for when TMS should deploy more Triton instances

    Create the autoscaling enabled lease. Since this is an ensemble model made up by three componenets, we use the -m flag three times to specify each component:

    Again, this may take a few minutes, so be patient.

    Copy
    Copied!
                

    ./tmsctl lease create -m "name=transformer_tensorrt_model,uri=model://volume-models/text-embeddings/transformer_tensorrt_model/" -m "name=transformer_tensorrt_tokenize,uri=model://volume-models/text-embeddings/transformer_tensorrt_tokenize/" -m "name=transformer_tensorrt_inference,uri=model://volume-models/text-embeddings/transformer_tensorrt_inference/" --triton-image nvcr.io/nvlp-aienterprise/nvlp-customers/triton-transformers:1.0 --enable-autoscaling --autoscaling-metric-queue-time enable --autoscaling-metric-queue-time-threshold 20us

    Again, be sure to take note of the Triton URL output from the lease creation command, and this time set it as an environment variable with the port 8000 attached for convience. For an exmaple of where the URL will appear in the output, refer to the previous section.

    Note

    This time we are using the port 8000 to perform inference, don’t forget to include it in your environment variable

    Copy
    Copied!
                

    export triton_url=<URL from lease creation>:8000

    TMS will take a few minutes to start collecting autoscaling metrics, so we need to wait before stress testing the model. You can check to see if the metrics are being collected by running the following command:

    Copy
    Copied!
                

    kubectl get hpa

    If the metrics aren’t ready yet, you will see something like this (note the unknown label under targets):

    Copy
    Copied!
                

    NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE triton-8c7a2289 Deployment/triton-8c7a2289 <unknown>/20 1 2 1 45s

    Once the metrics are successfully being collected for the Triton instance, you will see something like this:

    Copy
    Copied!
                

    NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE triton-8c7a2289 Deployment/triton-8c7a2289 0/20 1 2 1 120s

    Once the metrics are collected, we are ready to perform some inference testing.

We will now demonstrate the autoscaling feature by using Triton’s perf_analyzer tool to send a high volume of requests to our model and analyze the performance increase from autoscaling.

  1. The perf_analyzer tool is included in the NVIDIA Triton client container, which we deployed earlier. Start a shell in our triton-client pod.

    Copy
    Copied!
                

    kubectl exec -it triton-client -- env triton_url=$triton_url bash


  2. Create a short bash script to set up multiple perf_analyzer tests

    Copy
    Copied!
                

    cat << 'EOF' > inference-tests.sh if [ $# -eq 0 ]; then echo "Error: Argument not provided. Usage: $0 <argument>" exit 1 fi echo "Running stress tests" perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output1.txt & perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output2.txt & wait perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output3.txt & perf_analyzer -b 1 -m transformer_tensorrt_inference -u $1 --shape TEXT:16 --measurement-mode=time_windows --measurement-interval=50000 > output4.txt & wait echo "Done running stress tests" EOF


  3. Run the stress test on the model. This will take 5-10 minutes, so be patient.

    Copy
    Copied!
                

    bash inference-tests.sh $triton_url


  4. Examine the outputs of the perf_analyzer runs. Your exact results may vary, but notice that the inferences per second increases in the second two stress tests. This is because TMS recognized the high volume of inference requests and scaled the model out across both of our GPUs.

    Copy
    Copied!
                

    tail -n 1 output1.txt output2.txt output3.txt output4.txt

    Here is an example of what your output may look like. Notice how the performance increases in the second two stress tests:

    autoscaling-results.png


  5. Now let’s stress test the model without autoscaling. First, exit out of the triton-client pod.

    Copy
    Copied!
                

    exit


  6. Once outside of the triton-client pod, remove the autoscaling lease and recreate it without autoscaling. You can get the lease ID by using ./tmsctl lease list

    Copy
    Copied!
                

    ./tmsctl lease release <lease-ID>

    Copy
    Copied!
                

    ./tmsctl lease create -m "name=transformer_tensorrt_model,uri=model://volume-models/text-embeddings/transformer_tensorrt_model/" -m "name=transformer_tensorrt_tokenize,uri=model://volume-models/text-embeddings/transformer_tensorrt_tokenize/" -m "name=transformer_tensorrt_inference,uri=model://volume-models/text-embeddings/transformer_tensorrt_inference/" --triton-image nvcr.io/nvlp-aienterprise/nvlp-customers/triton-transformers:1.0

    Once again, be sure to take note of the Triton URL output from the lease creation command, and set it as an environment variable for convenience

    Copy
    Copied!
                

    export triton_url=<URL from lease creation command>


  7. Re-enter our Triton SDK pod to rerun the stress tests without autoscaling

    Copy
    Copied!
                

    kubectl exec -it triton-client -- env triton_url=$triton_url bash

    The bash script we created should still exist in the pod, so we can reuse it with our new Triton URL

    Copy
    Copied!
                

    bash inference-tests.sh $triton_url


  8. Now examine the outputs of the perf_analyzer runs without autoscaling. Notice that we are now no longer seeing the inferences per second increase in the later stress tests, since autoscaling is disabled.

    Copy
    Copied!
                

    tail -n 1 output1.txt output2.txt output3.txt output4.txt


Previous Step #4: Creating Leases and Performing Inference
Next Step #6: Bin Packing with TMS
© Copyright 2022-2024, NVIDIA. Last updated on May 2, 2024.