Autoscaling Leases - NVIDIA Docs

TMS can automatically scale the number of Triton instances associated with a lease based on utilization. This means that as a lease becomes heavily utilized, TMS can transparently add more Triton instances to service inference requests, and as demand decreases, it automatically remove unneeded instances.

TMS users can leverage autoscaling to speed up inference.

To enable and configure autoscaling you must:

Install necessary third-party tools.
Configure the TMS server.
Request autoscaling for a lease.

Installing Prerequisites

To make autoscaling work, TMS needs to be able to collect performance metrics and make them available to Kubernetes for determining when to automatically scale leases. This requires two third-party tools to be installed in Kubernetes:

You must follow the latest instructions for installing, configuring, and securing these tools as provided by the developers of the tools. The instructions here are provided as an example.

Both of the tools have Helm charts available for a basic installation in Kubernetes. The basic installation can be used for testing purposes.

Note

If your cluster is already using Prometheus and the Prometheus Adapter and you can monitor pods in the namespace, you do not need to install separate copies for TMS.

Installing Prometheus

For the most up-to-date instructions for installing Prometheus, see their installation guide.

For production clusters, work with your system administrator to make sure that you properly configure and secure Prometheus. For testing purposes, Prometheus can be installed in Kubernetes using a Helm chart that is available on Github.

Note

This Helm chart is in beta and is subject to change.

To install Prometheus using Helm:

Copy
Copied!

            
            $ TARGET_NAMESPACE = ... # put your namespace name here
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm install -n $TARGET_NAMESPACE prometheus prometheus-community/kube-prometheus-stack

To verify installation:

Run kubectl get pods.
Verify that the Prometheus pods are running and healthy. It can take several minutes for the pods to start.

Installing the Prometheus Metrics Adapter

After Prometheus is installed, you can install the Prometheus metrics adapter. For production clusters, work with your system administrator to ensure that security concerns are properly addressed.

To install Prometheus Adapter using Helm:

Find the name of the Prometheus service by running kubectl get svc.

Typically, a service named prometheus-kube-prometheus-prometheus is returned.

If you do not see a service named prometheus-kube-prometheus-prometheus, review the Prometheus installation, or see if an update to the Prometheus Helm chart has changed the name of the service.

Copy
Copied!

            
            ```shell
$ kubectl get svc
NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
prometheus-kube-prometheus-prometheus     ClusterIP   10.152.183.39    <none>        9090/TCP                     6h33m
prometheus-grafana                        ClusterIP   10.152.183.204   <none>        80/TCP                       6h33m
prometheus-kube-prometheus-operator       ClusterIP   10.152.183.197   <none>        443/TCP                      6h33m
prometheus-prometheus-node-exporter       ClusterIP   10.152.183.154   <none>        9100/TCP                     6h33m
prometheus-kube-prometheus-alertmanager   ClusterIP   10.152.183.155   <none>        9093/TCP                     6h33m
prometheus-kube-state-metrics             ClusterIP   10.152.183.117   <none>        8080/TCP                     6h33m
alertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   6h33m
prometheus-operated                       ClusterIP   None             <none>        9090/TCP                     6h33m
```

With the name of the Prometheus service, you can now install the Prometheus adapter using the following syntax:

Copy
Copied!

            
            $ TARGET_NAMESPACE = ... # put your namespace name here
$ helm install -n $TARGET_NAMESPACE prometheus-adapter prometheus-community/prometheus-adapter --set=prometheus.url=http://prometheus-kube-prometheus-prometheus

If everything installed successfully, Prometheus starts collecting metrics from the cluster within a few minutes.

Verify metrics collection by getting metrics using the Kubernetes custom metrics API Helm chart that is available on Github.

Note

If you don’t have jq installed in your system, your output is unformatted and in a single line.

Copy
Copied!

            
            $ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | less

Validate that your output is similar to the following. The actual entries don’t matter, so long as there are some entries.

Copy
Copied!

            
            {
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "services/node_memory_KReclaimable_bytes",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "services/prometheus_remote_storage_string_interner_zero_reference_releases",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Server Configuration

Because it requires the installation of additional third-party components, lease autoscaling is disabled by default. To enable it, you must set the appropriate values in values.yaml. These options are in the server.autoscaling section of the file. The TMS administrator must set these values based on the hardware available in the cluster, the expected workloads for this particular installation, and the configuration options of Prometheus. For example, a typical values YAML file is similar to the following:

Copy
Copied!

            
            server:
  autoscaling:
    enable: false
    replicas:
      default:
        minimum: 1
        maximum: 5
      limits:
        maximum: 10
        minimum:
          lowerBound: 1
          upperBound: 2
    metrics:
      cpuUtilization:
        allowed: false
        enabled: false
        threshold:
          default: 90
          minimum: 50
          maximum: 100
      gpuUtilization:
        allowed: false
        enabled: false
        threshold:
          default: 90
          minimum: 50
          maximum: 100
        queueTime:
          allowed: false
          enabled: false
          threshold:
            default: 10000
            minimum: 10000
            maximum: 0
    prometheus:
      podMonitorLabels:
        release: prometheus
      ruleLabels:
        release: prometheus

The autoscaling options are:

enable (default false): Controls whether autoscaling is enabled. Valid values are true and false.
replicas (dictionary): Controls the number of replicas allowed for leases. See values.yaml for further details.
metrics (dictionary): A set of metrics that can be used for autoscaling. For more information, see Configuring Autoscaling Metrics.
prometheus (dictionary): Options that specify how Prometheus finds Kubernetes objects created by TMS that are used in autoscaling. For more information, see Configuring Prometheus Objects.

In the example, if server.autoscaling.enable is switched to true, the following happens:

If a user does not request autoscaling for a lease, their lease is not automatically scaled.
If a user requests autoscaling for a lease but does not specify a maximum number of replicas, at most 5 replicas (server.autoscaling.replicas.default.maximum) are created.
If a user requests autoscaling for a lease and they specify the maximum number of replicas, they can request up to 10 replicas (server.autoscaling.replicas.limits.maximum).

The section on requesting autoscaling leases describes how to make these requests.

Configuring Autoscaling Metrics

The server.autoscaling.metrics dictionary defines a series of metrics that can trigger autoscaling. Each metric consists of a threshold along with a Boolean flag indicating whether or not the metric is configured. Based on this, each metric is used to calculate a target number of replicas. The largest number is then used. The details of how each target number is calculated can be found in the Kubernetes documentation.

The metrics are as follows:

cpuUtilization (dictionary): Scale based on high CPU utilization. Values are expressed as a percentage.
gpuUtilization (dictionary): Scale based on high GPU utilization. Values are expressed as a percentage.
queueTime (dictionary): Scale based on inference requests spending a long time in the queue before they are executed. Values are expressed in microseconds.

Each metric has the following entries:

enable (default false): Whether to enable this metric by default.
allowed (default false): Whether this metric can be enabled on a per-lease basis.
threshold (dictionary): Values that determine when to scale up and down a lease.
threshold.default (integer): The default value, if not specified on a per-lease basis.
threshold.minimum (integer): The minimum value allowed when specified on a per-lease basis.
threshold.maximum (integer): The maximum value allowed when specified on a per-lease basis.

Configuring Prometheus Objects

When autoscaling is enabled, TMS creates a number of Kubernetes objects related to Prometheus and Prometheus must be able to detect these objects. Use the server.autoscaling.prometheus entry in values.yaml to configure this.

The server.autoscaling.prometheus entry has the following entries:

podMonitorLabels (dictionary): A set of labels that are added to PodMonitor objects so that Prometheus can monitor the metrics of the Triton pods. This must match the value of .spec.podMonitorSelector in your Prometheus configuration.
ruleLables (dictionary): A set of labels that are added to PrometheusRule objects so that Prometheus can detect rules used by TMS to define new metrics. This must match the value of .spec.ruleSelector in your Prometheus configuration.

If your Prometheus installation has specified values for .spec.podMonitorNamespaceSelector or .spec.ruleNamespaceSelector, you must ensure that the namespace into which you install TMS has matching labels applied to it.

Verifying That Autoscaling Leases Are Working Properly

Requesting Autoscaling Leases

On a server that is properly configured, you can request lease support for autoscaling using the programmatic gRPC API or the tmsctl command-line tool. The documentation for the API and tool contains details of the different flags and their usage.

Note

This section only provides an example of how to request autoscaling using tmsctl.

To request autoscaling with the default parameters, add the --enable-autoscaling flag, in the following example $MODEL_OPTIONS is a stand-in for whatever model you want to load:

Copy
Copied!

            
            $ tmscl lease create -m $MODEL_OPTIONS --enable-autoscaling

To specify the maximum number of replicas, use the --autoscaling-max-replicas option. For example, to request a maximum of four replicas:

Copy
Copied!

            
            $ tmscl lease create -m $MODEL_OPTIONS --enable-autoscaling --autoscaling-max-replicas 4

In both cases, the leases start with a single replica of Triton, and as inference requests increase, the number of Triton instance for the lease increase until they reach their maximums.

Troubleshooting

To adjust its behavior dynamically, Prometheus has many rules that determine how it searches for different Kubernetes objects. If these don’t match how you configure TMS, you can experience autoscaling failures. The Prometheus documentation provides detailed information on all the options. This section covers some of the more common issues.

Symptom: You are not seeing any metrics collected for your Triton pods.

Things to Check:

Verify that you set .server.autoscaling.prometheus.podMonitorLabels in values.yaml to match the labels defined by .spec.podMonitorSelector in your Prometheus installation.
If your Prometheus installation has .spec.podMonitorNamespaceSelector set, verify that your namespace has matching labels (for example, run kubectl label ns tms_namespace someLabel=someValue).

Symptom: The metric for autoscaling based on queue time (tms_avg_request_queue_duration) is not being collected.

Things to Check:

Validate that you set .server.autoscaling.prometheus.ruleLabels in values.yaml to match the labels defined by .spec.ruleSelector in your Prometheus installation.
If your Prometheus installation has .spec.ruleNamespaceSelector set, make sure that your namespace has matching labels (for example, run kubectl label ns tms_namespace someLabel=someValue).