What are the advantages of running a model with TensorRT Inference Server compared to running directly using the model’s framework API?¶
When using TensorRT Inference Server the inference result will be the same as when using the model’s framework directly. However, with the inference server we get benefits like concurrent model execution (the ability to run multiple models at the same time on the same GPU) and dynamic batching to get better throughput. We can also replace or upgrade models while the inference server and client application are running. Another benefit is that the inference server can be deployed as a Docker container, anywhere – on premises and on public clouds. TensorRT Inference Server also supports several frameworks such as TensorRT, TensorFlow, PyTorch, and ONNX on both GPUs and CPUs leading to a streamlined deployment.
Can TensorRT Inference Server run on systems that don’t have GPUs?¶
Can TensorRT Inference Server be used in non-Docker environments?¶
Yes. TensorRT Inference Server has a CMake build that allows the inference server to be built from source making it more portable to non-Docker environments. For more details, see Building the Server with CMake. After building you can then run the inference server outside of Docker as described in Running The Inference Server Without Docker.
How would you use TensorRT Inference Server within the AWS environment?¶
In an AWS environment, the TensorRT Inference Server docker container can run on CPU-only instances or GPU compute instances. The inference server can run directly on the compute instance or inside Elastic Kubernetes Service (EKS). In addition, other AWS services such as Elastic Load Balancer (ELB) can be used for load balancing traffic among multiple inference server instances. Elastic Block Store (EBS) or S3 can be used for storing deep-learning models loaded by the inference server.
How do I measure the performance of my model running in the TensorRT Inference Server?¶
A client application, perf_client, allows you to measure the performance of an individual model using a synthetic load. The perf_client application is designed to show you the tradeoff of latency vs. throughput.
How can I fully utilize the GPU with TensorRT Inference Server?¶
TensorRT Inference Server has several features designed to increase GPU utilization:
The inference server can simultaneous perform inference for multiple models (using either the same or different frameworks) using the same GPU.
The inference server can increase inference throughput by using multiple instances of the same model to handle multiple simultaneous inferences requests to that model. The inference server chooses reasonable defaults but you can also control the exact level of concurrency on a model-by-model basis.
The inference server can batch together multiple inference requests into a single inference execution. Typically, batching inference requests leads to much higher thoughput with only a relatively small increase in latency.
As a general rule, batching is the most beneficial way to increase GPU utilization. So you should alway try enabling the dynamic batcher with your models. Using multiple instances of a model can also provide some benefit but is typically most useful for models that have small compute requirements. Most models will benefit from using two instances but more than that is often not useful.
If I have a server with multiple GPUs should I use one TensorRT Inference Server to manage all GPUs or should I use multiple inference servers, one for each GPU?¶
TensorRT Inference Server will take advantage of all GPUs on the server that it has access to. You can limit the GPUs available to the inference server by using the CUDA_VISIBLE_DEVICES environment variable (or with Docker you can also use NVIDIA_VISIBLE_DEVICES when launching the container). When using multiple GPUs, the inference server will distribute inference request across the GPUs to keep them all equally utilized. You can also control more explicitly which models are running on which GPUs.
In some deployment and orchestration environments (for example, Kubernetes) it may be more desirable to partition a single multi-GPU server into multiple nodes, each with one GPU. In this case the orchestration environment will run a different inference server for each GPU and an load balancer will be used to divide inference requests across the available inference server instances.