Step #6: Bin Packing with TMS

Another one of TMS’s key features is GPU bin packing, which co-locates models on the same GPU when sufficient capacity is available. To enable this, we will pre-specify the expected GPU memory usage for our models, and TMS will handle loading the model on the appropriate GPU. This helps prevent issues that arise when different frameworks like PyTorch and TensorFlow models run on the same server and fail to release GPU memory properly.

For this lab, we will perform a relatively simple bin packing example, where TMS will simplify the process of co-locating and serving models on our GPUs, which of course can be performed at a larger scale in production.

Bin Packing with TMS looks something like this:


For this lab, we are using an NVIDIA A30 GPU with 24GB of memory. To demonstrate bin packing, we have activated Multi-Instance GPU (MIG) on our A30, using two 12GB MIG slices. This way, we can demonstrate bin packing across the two GPU slices.

TMS uses the concept of pools of leases to perform bin packing. Triton Pools enable TMS administrators to create a set of Triton instances which can be shared by any leases created and assigned to the pool. Multiple pools can exist simultaneously with each pool having its own definition and purpose. You can read more about the pool feature in the documentation here.

With TMS pools, each lease is created with a user-specified “quota” value, and each pool of leases specifies how many Triton instances it can deploy and how much quota they can each consume. The quota value is abstract, and is therefore left up to the user to determine it’s meaning. For example, in this lab, we will have the quota value represent MB of data. We will create a pool that has a Triton instance limit of 2 (since we have 2 GPU instances) and a quota limit of 12000 per instance (since each GPU instance has about 12000 MB of memory).

Create a pool named sens-info using tmsctl


./tmsctl pool create sens-info --instance-quota 12000 --instances 0,2 --disable-backend-uniqueness

Consider a situation where you have various sensitive information detection AI models that you use for company security, and each model is fine-tuned for a specific employee. In this case, you need to serve many models at the same time since there is one model per employee.

Since all these models have the same backend, we can use TMS to perform bin packing and maximize memory efficiency. In the sid subdirectory of model_repository, you will find 9 sensitive information detection models that we will use in this lab. We know that these models, at maximum, each consume roughly 1550 MB of GPU memory, which we will use when specifying the quota for each lease.


In production environments, you can use NVIDIA’s model-analyzer or another tool to determine how much GPU memory your models will consume at maximum.

  1. Create a lease for our first SID model, specifying our pool.

    After each lease creation, you can use kubectl get pods to see the Triton instance(s) that have been deployed. Since each MIG slice has 12 GB of memory (and our pool creation reflects this), TMS will co-locate the first 7 leases on the same GPU.


    ./tmsctl lease create -m "name=sid-minibert-onnx-0,uri=model://volume-models/sid/sid-minibert-onnx-0/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-1,uri=model://volume-models/sid/sid-minibert-onnx-1/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-2,uri=model://volume-models/sid/sid-minibert-onnx-2/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-3,uri=model://volume-models/sid/sid-minibert-onnx-3/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-4,uri=model://volume-models/sid/sid-minibert-onnx-4/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-5,uri=model://volume-models/sid/sid-minibert-onnx-5/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-6,uri=model://volume-models/sid/sid-minibert-onnx-6/" --quota 1550 -p sens-info --duration 30m

    Now can use ./tmsctl pool status sens-info -v to see the status of our pool, which will show one Triton instance deployed, and all 7 of these leases located on the same Triton instance. Now, we will create two more leases, for which TMS will automatically deploy another Triton instance on the other MIG slice to store these two models.


    ./tmsctl lease create -m "name=sid-minibert-onnx-7,uri=model://volume-models/sid/sid-minibert-onnx-7/" --quota 1550 -p sens-info --duration 30m ./tmsctl lease create -m "name=sid-minibert-onnx-8,uri=model://volume-models/sid/sid-minibert-onnx-8/" --quota 1550 -p sens-info --duration 30m

  2. You should now see two Triton pods deployed when running kubectl get pods, and you can use ./tmsctl pool status sens-info -v to see the updated status of the pool.

© Copyright 2022-2023, NVIDIA. Last updated on Sep 29, 2023.