A lease is the primary organizational unit used by TMS. Leases allow you to describe which models to load, and how to control their lifecycle. Leases provide a convenient means of describing the Triton instance where the models are loaded, without having to manage the Triton instance directly. Additionally, leases allow you to configure features like autoscaling or sharing Triton servers.
A lease consists of the following:
A model or ensemble of models that are loaded together in a Triton instance.
A description of the Triton instance where to load the models. This can be:
A bespoke Triton instance that is created just for this lease and is not shared. Bespoke Triton instances support autoscaling.
The name of a pre-existing [Triton pool](./triton-pools.md], where the lease can share a Triton instance with other leases.
Information about the duration of the lease, including whether it supports automatic renewal based on usage.
For example, a lease might have the following characteristics:
Consists of a single model named
model_A, along with the URI from which to fetch it.
Uses a bespoke Triton instance with the default configuration as set by the system administrator.
Lasts for the default duration as set by the system administrator.
A more complex lease might have the following characteristics:
Consists of an ensemble of models. The models are named
model_C, and each is specified along with the URI from which to fetch it. These models are loaded in the order specified to ensure the ensemble works properly.
Describes a bespoke Triton instance on which to run with custom resource requirements, which include:
The Triton instance supports autoscaling up to four copies, scaling up when inference queue time exceeds 200ms.
Each Triton requires 2 GPUs, 4 CPUs, and 32 GB of main memory.
Contains leases that remain active for 8 hours and automatically renew for another four hours, so long as it has served inference requests in the 30 minutes before the lease is scheduled to expire.
The exact lifecycle of a lease varies depending on the application requirements, but a typical lease lifecycle requires the following tasks:
Create a lease using a
Lease.Acquire()API call and get the URL of the Triton instance where the models were loaded.
Run inference against the models in the lease using the Triton inference API.
Potentially renew the lease using
Lease.Rewnew()calls, or let it automatically renew if it is configured to renew while still being used.
Either manually release the lease using a
Lease.Release()call, or let it expire. Either way, this frees the associated Triton instance for bespoke leases or marks the resources as available for leases running on pooled Triton instances.
For more details on the available operations, continue reading. There is also a tutorial that guides you through the basics of of working with leases.
The Lease Service exposes a number of RPC end-points including:
The Triton Allowlist Service exposes the following RPC end-points:
Each of the end-points accept a single structured request and respond with a structured response.
The gRPC protocol supports streaming requests and/or responses. One or both sides of the interaction can stream data to the other. Functionally, this allows the server to begin sending response data before the client has finished sending request data.
The expected order of operations for the Lease Service are:
Createto create a new lease with a specified set of AI models.
Assuming the request is successful, the response includes a unique identifier and an expiration date for the new lease.
All models in a lease acquire request are considered bundled. They cannot be loaded or unloaded separately. Additionally, all models in a lease are loaded into the same instance of the Triton Inference Server. If it is impossible to do so (for example, insufficient memory), then the lease is marked as invalid and any successfully loaded models are unloaded after the first model load failure is detected. TMS does not support partially loaded leases.
This RPC begins streaming a response after the request has been received. The server sends a series of model status updates to the caller to show continued progress as the lease’s models are deployed. Model status updates are sparse, they do not include the status of every model every time.
The final response from the server includes the status for every model in the request and data for the lease itself.
Renewto extend the lease’s duration. After a lease is renewed it is assigned a new expiration date.
After a lease has expired, it is no longer valid, any associated models are unloaded and become unavailable. Any resources consumed by the lease are returned to the hosting Triton Inference Server to be used by future leases. If a Triton Inference Server instance becomes unnecessary, it is deleted and its resources returned to the cluster.
Statusto get the current status of a specific lease.
Requesting the status of an expired or released lease is a valid operation.
Releaseto terminate a lease before its expiration is reached.
After a lease has been released, it is no longer valid and any associated models are unloaded and become unavailable. Any resources consumed by the lease are returned to the hosting Triton Inference Server for use by future leases. If a Triton Inference Server instance becomes unnecessary, it is deleted and its resources are returned to the cluster.