Triton Pools and Quota Base Shared Tritons

Deployment Guide (1.1.0)

Triton Pools enable TMS administrators to create a set of Triton instances that can be shared by any leases created and assigned to the pool. Multiple pools can exist simultaneously with each pool having its own definition and purpose.

Pool definitions allow for the specification of the:

  • container image used to deploy Triton instances

  • resources reserved for and assigned to each TIS instance in the pool

A pool definition includes a minimum and maximum pool size, which is the number of concurrent Triton instances the pool supports. The definition also contains a per-instance quota value used by TMS. The quota is used to determine the best candidate Triton instances for assignment of new leases.

Note

Triton Pools work best in clusters with homogeneous GPUs. TMS does not take GPU SKU into consideration when determining the capacity of Triton instances.

Name

Triton Pools must be given a name. A pool’s name must be unique among all other existing pools. A name can be reused after the name’s previous pool has been deleted. This name is used to identify the pool when creating leases, or interacting with the pool.

Per-Instance Quota

Triton Pools must have a per-instance quota value. The quota value is used to determine which Triton instances have available “space” to assign new leases.

The meaning of the units used to define the quota are determined by the pool’s creator. TMS applies no specific meaning to the value.

For example, a pool could be created in a cluster with GPUs that all have 40Gi of memory. The TMS administrator could then decide that each Triton instance should be assigned a per-instance quota value of 40, and expect that all leases deployed into the pool specify the amount of GPU memory they require in gigabytes.

TMS only compares a lease’s quota request against any available quota on Triton instances in the pool. TMS does not inspect Triton or the GPU to determine the actual amount of available GPU memory. There is no enforcement of quota values at runtime.

For example, if you know that each Triton instance in a pool is capable of hosting four leases, you can create the pool with a per-instance quota value of 4, with each lease specifying a quota value of 1.

Each unit of per-instance quota is equal to a single “computing slot” and each lease would consume one, or more, of them.

Instance Limits

Triton Pools are defined with a minimum and maximum number of Triton instances that they’re allowed to create and host. When the minimum value is greater than zero, the pool attempts to always have at least that number of Triton instances available.

These limits are used to determine when a pool scales the number of Triton instances present in the pool. When a lease is created and assigned to the pool and there are no available Triton instances with sufficient available quota to host the lease, the pool attempts to add a new Triton instance, but only if it has not already reached its maximum capacity. When a pool has insufficient capacity and cannot add a new Triton instance, any lease creation attempts are rejected.

Note

When attempting to create new Triton instances, pools are limited by the amount of available cluster resources.

Triton Definition

Triton Pool definitions include a specification for each Triton instance in the pool. Triton definitions include the Triton container image used, the number of logical CPU cores, the number of GPUs, and the amount of memory reserved and assigned to each Triton instance created by the pool.

Enforcement of Triton Backend Uniqueness

Triton Pool definitions include an option to not enforce that each Triton instance be restricted to the Triton backends used by the first lease deployed on it. Enforcement is enabled by default because the mixing of Triton backends is discouraged due to issues with memory management.

For example, a Triton instance is deployed with enforcement enabled. The first least deployed to the instance is an ensemble of two models; the first is a TensorFlow model, the second is a PyTorch model. The TensorFlow and PyTorch backends each allocate as much memory as possible, effectively splitting the available memory between them.

From this point on, because enforcement is enabled, only leases that depend on the TensorFlow and/or PyTorch backends are deployed to this Triton instance. When a second lease is deployed to this Triton instance, it only contains models with backends meeting this requirement.

The first time TMS encounters a model, its backend is considered unknown and therefore cannot be assigned to any existing Triton instance. After deployment, TMS determines which Triton backend the model depends on and updates its record of the Triton instance to reflect the correct mix of backends active on the instance. Additionally, TMS records the Triton backend information of the model such that all future deployments of the same model are able to correctly select the Triton instances that match the model’s Triton backend requirements.

When enforcement is disabled, TMS selects Triton instances without taking into consideration model backends or which Triton backends are active on instances. This simplifies instance selection, but incurs the risk of attempting to load a model with a backend that’s not present on a Triton instance. There is also a risk that the instance has insufficient memory available to the load the model.

Enabling enforcement is recommended unless extensive testing with a restricted set of models has been done to ensure Triton instance stability.

Quota based shared Triton (QBST) leases are defined as one or more models with a specified quota consumption value and are assigned to a Triton Pool. The specified quota consumption value, or quota, is used to determine how the lease’s models are hosted. Leases with multiple models always have all models hosted by a single Triton instance.

Quota

The quota value of a QBST lease defines the amount of “space” or resources that the lease consumes. The units or meaning of the value for the quota is defined by the pool’s creator. TMS applies no specific meaning to the value.

For example, a Triton Pool might define units in terms of “compute fraction” with each Triton instance being assigned a denominator value. For this example, each Triton instance is assigned a quota capacity of 8. Any lease assigned to this pool must specify what fraction of a Triton instance the lease will consume. This is done by the lease’s quota value.

A lease could be created that expects to consume a quarter of the capacity of a Triton instance and would be assigned a quota value of 2. This lease could share the Triton instance with any combination of other leases whose quota sum is less than or equal to 6 (the remaining quota capacity).

It is up to the pool’s administrator to determine the units and meaning of a pool’s quota, and the measures by which a lease is expected to determine the amount of quota it consumes.

TMS uses lease and pool quota values to determine how and where to place leases within a pool. TMS does not enforce any kind of resource utilization after a lease has been assigned to a Triton instance.

Extending the previous example, a lease creator specifies that their lease consumes 2 quota units. In actuality, the lease consumes 8 quota units (that is, an entire Triton instance). Because the lease consumes significantly more resources than advertised, several of the loaded models, including models loaded for other leases, experience significant performance degradation and out of memory errors.

It is important to test the quota consumption values of leases before creating them in a production environment. Undervaluing leases can lead to performance degradation, out of memory errors, and Triton instance instability. Overvaluing leases, while often safer, can leave hardware under utilized and potentially cause capacity issues due to external processes being forced to wait for available AI cycles.

© Copyright 2023, NVIDIA. Last updated on Dec 11, 2023.