Step #4: Creating Leases and Performing Inference
We are now ready to use TMS to create a leases. You can think of a lease as TMS’s way of telling Triton to serve specific models for a specific duration of time. The benefit of this is that TMS will serve your models for however long you specify, and then automatically unload them from your GPUs. TMS also recognizes when the models are actively being used, and can automatically renew the model’s lifetime until it is no longer in use. We will be creating and managing leases using
tmsctl, which is TMS’s command line tool for interacting with TMS. You can read the
tmsctl documentation here.
You can also use the TMS API to create leases and interact with TMS, but in this lab we will only use
Navigate to the
tmsctldirectory through the SSH console.
Outside of this lab, you would have to pull
Since our SSH console is outside of the cluster, we need to start a port-forwarding process to the TMS service on the cluster. You can see the TMS service on the cluster, which uses port 30345 by default.
$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d tms ClusterIP 10.104.242.223 <none> 30345/TCP 5m
Start the port-forward process in the background.
kubectl port-forward svc/tms 30345:30345 &Note
If you leave and return to the SSH console, this background process might be killed. If you are seeing errors while trying to use
tmsctlit is likely because this port forwarding process isn’t running.
Finally, we have to set
tmsctl’s target, which is our local cluster running TMS.
./tmsctl target add --force --set-default tms http://127.0.0.1:30345
We are now ready to use
tmsctl to interact with TMS.
We’ll start by creating a basic lease for image recognition. There are many lease creation options supported by
tmsctl, some of which we will take advantage of to deploy this model:
--durationtells TMS to how long to serve the model for
--auto-renewtells TMS to automatically renew the lease (serve the model for longer) if it has received an inference request recently
--renewal-durationtells TMS how long to renew the lease for when automatically renewing
--auto-renew-activity-windowtells TMS how recently the model needs to have received a request in order for the lease to auto-renew before expiring
Create the lease using
tmsctl. This may take a few minutes, so be patient.
./tmsctl lease create -m "name=densenet_onnx,uri=model://volume-models/image-rec/densenet_onnx/" --duration 30m --auto-renew --renewal-duration 10m --auto-renew-activity-window 5m
Be sure to make note of the Triton URL output from the lease creation command. The URL will look something like
triton-3eaa065c.default.svc.cluster.localand appear here:
Creating a lease will deploy a Triton pod to serve our model, which you can see deployed to our cluster
$ kubectl get pods NAME READY STATUS RESTARTS AGE tms-5444f8df5b-5lkws 2/2 Running 0 20m triton-03923a23-bd6f8cc86-5kz24 2/2 Running 0 2m triton-client 1/1 Running 0 20m
You can see the lease we just created, as well as the associated Triton URL, by running the following command
./tmsctl lease list
To remove leases with
tmsctl, use the
./tmsctl lease releasecommand. For now, you can leave this lease active since we are going to send an inference request to it.
Once the lease is created, the model is being served and you should see the new Triton service on the cluster.
$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 168m tms ClusterIP 10.106.62.75 <none> 30345/TCP 6m54s triton-8dd12f38 ClusterIP 10.107.235.96 <none> 9345/TCP,8001/TCP,8000/TCP,8002/TCP 2m51s
Earlier we deployed a Triton SDK client that contains tools to help us interact with our models. Start a shell in the
kubectl exec -it triton-client -- bash
From the Triton SDK pod, send an inference request to the model containing this image:Note
Be sure to supply the Triton URL from lease creation in the inference command, and do not forget to add
:8001after the URL in this command to specify the GRPC port.
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION -i grpc /workspace/images/mug.jpg -u <triton URL from lease deployment>:8001Note
You can also send inference requests from outside of the cluster by port-forwarding to the Triton service, or by configuring a secure external URL for your cluster/service.
You can see the output from the model:
Request 0, batch size 1 Image '/workspace/images/mug.jpg': 15.349565 (504) = COFFEE MUG 13.227464 (968) = CUP 10.424888 (505) = COFFEEPOT
Once you are done sending the inference request, you can exit out of the Triton client pod:
Once out of the
triton-clientpod, you can unload the model from the GPU. In TMS terms, you release the lease
./tmsctl lease release <lease ID>
The lease ID was output from the lease creation command, and can also be retrieved through
./tmsctl lease list.
We’ve seen how to load models in/out of the GPU through the TMS lease system, which can automatically unload your models from your system, freeing up GPU space and reducing idle resource usage. Now let’s dive into more of the features and special capabilities of TMS.