TMS can automatically scale the number of Triton instances associated with a lease based on utilization. This means that as a lease becomes heavily utilized, TMS can transparently add more Triton instances to service inference requests, and as demand decreases, it automatically remove unneeded instances.
TMS users can leverage autoscaling to speed up inference.
To enable and configure autoscaling you must:
Install necessary third-party tools.
Configure the TMS server.
Request autoscaling for a lease.
To make autoscaling work, TMS needs to be able to collect performance metrics and make them available to Kubernetes for determining when to automatically scale leases. This requires two third-party tools to be installed in Kubernetes:
You must follow the latest instructions for installing, configuring, and securing these tools as provided by the developers of the tools. The instructions here are provided as an example.
Both of the tools have Helm charts available for a basic installation in Kubernetes. The basic installation can be used for testing purposes.
If your cluster is already using Prometheus and the Prometheus Adapter and you can monitor pods in the namespace, you do not need to install separate copies for TMS.
Installing Prometheus
For the most up-to-date instructions for installing Prometheus, see their installation guide.
For production clusters, work with your system administrator to make sure that you properly configure and secure Prometheus. For testing purposes, Prometheus can be installed in Kubernetes using a Helm chart that is available on Github.
This Helm chart is in beta and is subject to change.
To install Prometheus using Helm:
$ TARGET_NAMESPACE = ... # put your namespace name here
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm install -n $TARGET_NAMESPACE prometheus prometheus-community/kube-prometheus-stack
To verify installation:
Run
kubectl get pods
.Verify that the Prometheus pods are running and healthy. It can take several minutes for the pods to start.
Installing the Prometheus Metrics Adapter
After Prometheus is installed, you can install the Prometheus metrics adapter. For production clusters, work with your system administrator to ensure that security concerns are properly addressed.
To install Prometheus Adapter using Helm:
Find the name of the Prometheus service by running
kubectl get svc
.
Typically, a service named prometheus-kube-prometheus-prometheus
is returned.
If you do not see a service named prometheus-kube-prometheus-prometheus
, review the Prometheus installation, or see if an update to the Prometheus Helm chart has changed the name of the service.
```shell
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-kube-prometheus-prometheus ClusterIP 10.152.183.39 <none> 9090/TCP 6h33m
prometheus-grafana ClusterIP 10.152.183.204 <none> 80/TCP 6h33m
prometheus-kube-prometheus-operator ClusterIP 10.152.183.197 <none> 443/TCP 6h33m
prometheus-prometheus-node-exporter ClusterIP 10.152.183.154 <none> 9100/TCP 6h33m
prometheus-kube-prometheus-alertmanager ClusterIP 10.152.183.155 <none> 9093/TCP 6h33m
prometheus-kube-state-metrics ClusterIP 10.152.183.117 <none> 8080/TCP 6h33m
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 6h33m
prometheus-operated ClusterIP None <none> 9090/TCP 6h33m
```
With the name of the Prometheus service, you can now install the Prometheus adapter using the following syntax:
$ TARGET_NAMESPACE = ... # put your namespace name here $ helm install -n $TARGET_NAMESPACE prometheus-adapter prometheus-community/prometheus-adapter --set=prometheus.url=http://prometheus-kube-prometheus-prometheus
If everything installed successfully, Prometheus starts collecting metrics from the cluster within a few minutes.
Verify metrics collection by getting metrics using the Kubernetes custom metrics API Helm chart that is available on Github.
NoteIf you don’t have
jq
installed in your system, your output is unformatted and in a single line.$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | less
Validate that your output is similar to the following. The actual entries don’t matter, so long as there are some entries.
{ "kind": "APIResourceList", "apiVersion": "v1", "groupVersion": "custom.metrics.k8s.io/v1beta1", "resources": [ { "name": "services/node_memory_KReclaimable_bytes", "singularName": "", "namespaced": true, "kind": "MetricValueList", "verbs": [ "get" ] }, { "name": "services/prometheus_remote_storage_string_interner_zero_reference_releases", "singularName": "", "namespaced": true, "kind": "MetricValueList", "verbs": [ "get" ] } ] }
Because it requires the installation of additional third-party components, lease autoscaling is disabled by default.
To enable it, you must set the appropriate values in values.yaml
. These options are in the server.autoscaling
section of the file. The TMS administrator must set these values based on the hardware available in the cluster,
the expected workloads for this particular installation, and the configuration options of Prometheus. For example, a typical values YAML file is similar to the following:
server:
autoscaling:
enable: false
replicas:
default:
minimum: 1
maximum: 5
limits:
maximum: 10
minimum:
lowerBound: 1
upperBound: 2
metrics:
cpuUtilization:
allowed: false
enabled: false
threshold:
default: 90
minimum: 50
maximum: 100
gpuUtilization:
allowed: false
enabled: false
threshold:
default: 90
minimum: 50
maximum: 100
queueTime:
allowed: false
enabled: false
threshold:
default: 10000
minimum: 10000
maximum: 0
prometheus:
podMonitorLabels:
release: prometheus
ruleLabels:
release: prometheus
The autoscaling options are:
enable
(defaultfalse
): Controls whether autoscaling is enabled. Valid values aretrue
andfalse
.replicas
(dictionary): Controls the number of replicas allowed for leases. Seevalues.yaml
for further details.metrics
(dictionary): A set of metrics that can be used for autoscaling. For more information, see Configuring Autoscaling Metrics.prometheus
(dictionary): Options that specify how Prometheus finds Kubernetes objects created by TMS that are used in autoscaling. For more information, see Configuring Prometheus Objects.
In the example, if server.autoscaling.enable
is switched to true
, the following happens:
If a user does not request autoscaling for a lease, their lease is not automatically scaled.
If a user requests autoscaling for a lease but does not specify a maximum number of replicas, at most
5
replicas (server.autoscaling.replicas.default.maximum
) are created.If a user requests autoscaling for a lease and they specify the maximum number of replicas, they can request up to
10
replicas (server.autoscaling.replicas.limits.maximum
).
The section on requesting autoscaling leases describes how to make these requests.
Configuring Autoscaling Metrics
The server.autoscaling.metrics
dictionary defines a series of metrics that can trigger autoscaling.
Each metric consists of a threshold along with a Boolean flag indicating whether or not the metric is
configured. Based on this, each metric is used to calculate a target number of replicas. The largest
number is then used. The details of how each target number is calculated can be found in the
Kubernetes documentation.
The metrics are as follows:
cpuUtilization
(dictionary): Scale based on high CPU utilization. Values are expressed as a percentage.gpuUtilization
(dictionary): Scale based on high GPU utilization. Values are expressed as a percentage.queueTime
(dictionary): Scale based on inference requests spending a long time in the queue before they are executed. Values are expressed in microseconds.
Each metric has the following entries:
enable
(defaultfalse
): Whether to enable this metric by default.allowed
(defaultfalse
): Whether this metric can be enabled on a per-lease basis.threshold
(dictionary): Values that determine when to scale up and down a lease.threshold.default
(integer): The default value, if not specified on a per-lease basis.threshold.minimum
(integer): The minimum value allowed when specified on a per-lease basis.threshold.maximum
(integer): The maximum value allowed when specified on a per-lease basis.
Configuring Prometheus Objects
When autoscaling is enabled, TMS creates a number of Kubernetes objects related to Prometheus and Prometheus must be able to detect these objects. Use the
server.autoscaling.prometheus
entry in values.yaml
to configure this.
The server.autoscaling.prometheus
entry has the following entries:
podMonitorLabels
(dictionary): A set of labels that are added toPodMonitor
objects so that Prometheus can monitor the metrics of the Triton pods. This must match the value of.spec.podMonitorSelector
in your Prometheus configuration.ruleLables
(dictionary): A set of labels that are added toPrometheusRule
objects so that Prometheus can detect rules used by TMS to define new metrics. This must match the value of.spec.ruleSelector
in your Prometheus configuration.
If your Prometheus installation has specified values for
.spec.podMonitorNamespaceSelector
or .spec.ruleNamespaceSelector
, you must ensure that the namespace into which
you install TMS has matching labels applied to it.
Verifying That Autoscaling Leases Are Working Properly
On a server that is properly configured, you can request lease support for autoscaling using the programmatic gRPC API or the tmsctl command-line tool. The documentation for the API and tool contains details of the different flags and their usage.
This section only provides an example of how to request autoscaling using tmsctl
.
To request autoscaling with the default parameters, add the --enable-autoscaling
flag, in the following example
$MODEL_OPTIONS
is a stand-in for whatever model you want to load:
$ tmscl lease create -m $MODEL_OPTIONS --enable-autoscaling
To specify the maximum number of replicas, use the --autoscaling-max-replicas
option. For example,
to request a maximum of four replicas:
$ tmscl lease create -m $MODEL_OPTIONS --enable-autoscaling --autoscaling-max-replicas 4
In both cases, the leases start with a single replica of Triton, and as inference requests increase, the number of Triton instance for the lease increase until they reach their maximums.
To adjust its behavior dynamically, Prometheus has many rules that determine how it searches for different Kubernetes objects. If these don’t match how you configure TMS, you can experience autoscaling failures. The Prometheus documentation provides detailed information on all the options. This section covers some of the more common issues.
Symptom: You are not seeing any metrics collected for your Triton pods.
Things to Check:
Verify that you set
.server.autoscaling.prometheus.podMonitorLabels
invalues.yaml
to match the labels defined by.spec.podMonitorSelector
in your Prometheus installation.If your Prometheus installation has
.spec.podMonitorNamespaceSelector
set, verify that your namespace has matching labels (for example, runkubectl label ns tms_namespace someLabel=someValue
).
Symptom: The metric for autoscaling based on queue time (tms_avg_request_queue_duration
) is not being collected.
Things to Check:
Validate that you set
.server.autoscaling.prometheus.ruleLabels
invalues.yaml
to match the labels defined by.spec.ruleSelector
in your Prometheus installation.If your Prometheus installation has
.spec.ruleNamespaceSelector
set, make sure that your namespace has matching labels (for example, runkubectl label ns tms_namespace someLabel=someValue
).