Release Notes for Triton Management Service

Version 1.4

New Features

Added the ability to restrict access to Triton Control Operations and allow only the Triton Sidecar to perform such operations. This feature relies on Limited Endpoint Access feature in Triton. Since it is a BETA feature, it may result in compatability issues in the future.

Bug Fixes

Triton Pools feature can now be disabled by setting triton.pools.enabled to false in helm values.yaml. Previously, even after setting the above value to false, the Triton Pools feature remained enabled.

Version 1.3

New Features

Updated the default Triton image from 23.10 to 23.11.

Bug Fixes

When a model is deleted from a pooled Triton instance, the files are now deleted. Previously, they remained in the pod consuming space, which would lead to the pod eventually running out of storage.

Version 1.2

New Features

Updated the default Triton image from 23.09 to 23.10.
Added the ability to autoscale based on the percentage of total inference request time spent in the queue. This is in addition to the ability to scale on the absolute time spent in the queue. See the autoscaling options of tmsctl for more details.

Version 1.1

New Features

Updated the default Triton image from 23.08 to 23.09. Since 23.09 lowers the memory requirements of Triton, we also lowered the default value for the minimum amount of memory used by Triton pods.
In order to improve visibility on different console color schemes, the output of tmsctl no longer includes colors by default. Colors can be enabled and configured via tmsctl configuration options.
Added a feature to allow the TMS administrator to configure the CPU and memory resources allocated and reserved for the TMS API server and database. See the deployment guide for more information.
Model repository size for Triton instances can be configured during lease acquisition (initially fixed at 2GB prior to v1.1). See the section under --triton-resources in Triton Options and Configuring Triton Containers to learn more.

Bug Fixes

Configured colors for headers in tmsctl output are used properly now.
More informative message returned by LeaseService when a header is missing.
Lease.List() method now works with the verbose option enabled.
Leases are no longer terminated by the kubernetes startup probe when loading larger models (particularly LLMs).

Version 1.0

New Features

Support for Generic S3 Model Repositories (including MinIO and Google Cloud Storage). (See updated documentation for S3 Configuration).
Ability to configure longer timeout for loading larger models in values.yaml through server.lease.timeout.
Finalized our TMS v1.0 API with the intent of retaining forward/backward compatibility with all TMS v1.x releases. Be aware that several v0.x interfaces were changed and are incompatible with the v1.0 API. Please make sure to upgrade to the latest tmsctl and to upgrade any custom clients to avoid compatibility issues.

Bug Fixes

Fixed several small bugs in the Triton pool feature, including in the scheduling algorithm.
Fixed bugs which were causing leases which were cancelled during creation to remain in a pending state. There are still some situations when this may happen, but such leases can now be released via tmsctl lease release.
Fixed a bug which allowed Triton pods to report they were ready before they were fully operational.

Test Environments

This version of TMS was tested with the below versions of different packages. Other versions may work as well, but have not been tested.

Kubernetes:
- CNS v9.0 (Kubernetes 1.26 – see the CNS documentation for more details).
- EKS with Kubernetes 1.26 - 1.27
- AKS with Kubernetes 1.26
Triton: 23.03
Helm: 3.11
Prometheus:
- Prometheus: 2.45 (Prometheus community Helm chart version 48.1.1)
- Prometheus Adapter: 0.10 (Prometheus community Helm chart version 4.2.0)

Version 0.11

New Features

New configuration options were added in values.yaml to allow custom installations of Prometheus to work properly with autoscaling in TMS.
When configuring queue time thresholds for autoscaling in values.yaml, units must now be specified. Previously, the values were specified as a number of microseconds (e.g. 10000). Now, units are required (e.g. “10ms”, “100us”).
Option to configure TMS to persist data in the case of server failure or restart. (See Configuring Persisted Database)
New feature: Triton Pools has been added. Provides improved methods for maximizing utilization of Triton Servers deployed by TMS.

See: Triton Pools & Quota Based Shared Tritons for additional information.
REST service has been removed.

Fixes

Fixed a bug which prevented loading models in an S3 bucket which were in nested directories.

Version 0.10

New Features

Triton Recovery

Triton instances can now recover from failures. If a pod hosting Triton instance is killed, or if the Triton server dies, the pod will be restarted and all the models will be reloaded when the new pod starts.

Lease Names

Leases will now be able to have custom labels associated with them. These names can be used to interact with a lease in place of the default "Triton-####" label for kubernetes services. See the lease-name section under tmsctl to learn more.

Fixes

Fixed a bug which prevented large models from being loaded due to using too much ephemeral storage in the Triton Sidecar containers.

Version 0.9

New Features

Added the ability to set the minimum number of replicas in autoscaling leases. Administrator may configure thresholds and defaults for this value. Users may control it on a per-lease basis via the gRPC API and tmsctl.

Lease Events

Triton Management Service will now record events related to the lifecycle of a lease.

Lease event information will be provided during the creation of a new lease, as well as when requesting the status of a lease.

Lease events provide a mechanism for TMS to report Triton related, model loading errors.
Updated the output of tmsctl lease create, tmsctl lease list, and tmsctl lease status.
- tmsctl lease create now includes lease event information. The pretty and porcelain (-z) output have both been updated to include event information.
  
  The pretty print version of the output now reports the status of all models in the lease and attempt to avoid scrolling by resetting the cursor position for every update received from the server.
- tmsctl lease status now includes lease event information.
- tmsctl lease list no longer includes model information. This was removed to reduce load on the server’s database when large numbers of leases were being returned.

Support for TLS Encrypted Connections

Added support for TLS encrypted connections to TMS and Triton Servers.
With the initial implementation, certificate validation has been disabled for TMS Server, Triton Sidecar, and TMS Control (tmsctl). Certificate validation will be enabled in a future version of TMS.

Support for Public S3 Model Repositories without IAM

Added support for model repositories residing in public S3 buckets.
This provides an alternative option for accessing S3 models without IAM.

Horizontal Pod Autoscaling on Average Queue Time

Modified leases to scale on average inference queue time with Triton Server.
Prior to this change, the queue time metric was processed incorrectly and leases would not scale correctly.

Fixes

Fixed an issue where the URL of a lease’s Triton Server was not provided when reading a lease’s status.
Fixed an issue where the deployment of Triton Servers would fail when services that depend on init-containers were injected into the deployment due the “PodInitializing” container status not being correctly handled.
Fixed an issue where models in a multi-model lease were loaded at random into Triton Server. This issue resulted in failures for ensemble models, but is now fixed to load models in the ordering from the lease request.
Fixed an issue where the TMS deployment was failing due to bugs in parsing helm chart values.

Known Issues

Canceling tmsctl lease create by using Ctrl+C to terminate the process, correctly prevents the deployment of the requested lease, but leaves the lease in a Pending “zombie” state. Leases caught in this state are metadata artifacts that no impact on the behavior of TMS or any deployed Triton Servers.

Version 0.8

New Features

Added a tmsctl lease renew command to renew leases via tmsctl.
Added options to control lease duration on a per-lease basis. Users may specify the duration via the gRPC API and via tmsctl. Administrators can configure limits for these values.
Added options to control autoscaling parameters on a per-lease basis. Users may specify the duration via the gRPC API and via tmsctl. Administrators can configure limits for these values.
Added tmsctl target set and tmsctl target rm commands.
Added tmsctl lease list command.
Added official support for Azure Blob and Azure File model repositories.

Known Issues

Autoscaling on GPU utilization does not function correctly when Triton has been deployed to an Ampere MIG partition. Autoscaling on CPU utilization and queue time does work on such systems.
TMS Clean-up job can hang when uninstalling a TMS deployment using helm delete [tms-instance-name]. This can result in Helm timing out the uninstallation. helm delete [tms-instance-name] --no-hooks can be used as a workaround.

When using the workaround, cluster administrators might have to manually delete abandoned TMS pods, deployments, service, secrets, and/or certificates due to the clean-up job not having been run.

Version 0.7.1

Fixes

Lease Requests with Multiple Models

Fix has been added that enables users to request Triton leases with multiple models in a single request.

Known Issues

Autoscaling on GPU utilization does not function correctly when Triton has been deployed to an Ampere MIG partition. Autoscaling on CPU utilization and queue time does work on such systems.

Version 0.7

New Features

Improvements in Autoscaling Leases

Autoscaling leases now support automatic renewal, just like non-autoscaling leases.
The metrics that control scaling can be configured by the TMS administrator.

Expanded & Improved Metrics

TMS now reports model loading metrics and over 200 runtime metrics.

Model metrics are related to the loading of models into Triton Inference Server. Any reported model metrics will include the URL used to acquire the model.

Collected metrics have a “visibility score” assigned to them based on their importance and utility. When reporting metrics, TMS will only report the collected metrics that have a “visibility score” equal to or greater than the configured “minimum visibility”.

The minimum visibility value can be changed in the values.yaml file of the Helm chart.

Support for AWS S3 Model Repositories

TMS now supports reading models from AWS S3 model repositories. TMS administrators should take a look at the section of the model repository docs for assistance on this.

Known Issues

Autoscaling on GPU utilization does not function correctly when Triton has been deployed to an Ampere MIG partition. Autoscaling on CPU utilization and queue time does work on such systems.

Version 0.6

New Features

Lease Autoscaling

TMS users can now request that lease automatically scale the number of Triton instances servicing them based on utilization. For full details, see the autoscaling configuration and usage instructions.

Persistent Volume Repositories

Administrators can now attach model repositories in persistent volumes to their TMS instance. To learn more, please refer to the Persistent Volume Claims section in the model repository guide.

Additionally, TMS now supports AWS EBS persistent volumes. Refer to the Persistent Volume Claims section in the model repository guide.

Bug Fixes

Known Issues

Version 0.5

New Features

NFS Model Repositories

TMS administrators can now configure TMS with model repositories hosted on NFS servers which Triton instances can load models from.

Unlike http model repositories, NFS hosted repositories provide TMS the benefit of being able to now consume decompressed Triton models in lease requests.

To use an NFS model repository, TMS administrators will have to create a Kubernetes persistent volume with a respective persistent volume claim (in the same namespace as TMS) for the NFS server. The persistent volume claim name should be provided in TMS’s helm charts.

There is a guide in the quickstart guide providing more elaborate instructions. Additionally, see the default values under values.yaml#sidecar.modelRepositories.nfs to learn more.