Benchmarking Triton Inference Server
Triton Inference Server is the default way to deploy Deep learning models on any platform whether it is CPU or GPU. Deploying Triton on GPU provides better inference latency and more parallel request processing when compared to the CPU. Before we get to the results there are a few parameters that need to be understood as part of the benchmarking that can be tuned for optimal performance of the model.
We define:
Dynamic batching, i.e., server-side batching of incoming queries, significantly improves both latency and performance.
Concurrency to be the number of simultaneous requests to the server.
Latency to be the average time required by the server for processing a query.
Throughput to be the average number of queries processed by the server per second.
These parameters can be set during the launch of the Triton Inference Server. Follow the readme steps to set these parameters for your model.
The graphs below show the throughput and latency of Triton inference Server on a T4 compared to CPU. The batch size and concurrency are tuned to achieve maximum performance of BERT large on inference.
The Total Cost of Ownership (TCO) can be calculated using the cost per hour of inference on Google cloud instances for the CPU and GPU using the VM configuration above. GPU backed VM has 37 times better price to performance ratio for the first configuration (Batch Size of 2 with Concurrency of 32) and 37x for the second configuration (Batch Size of 8 and Concurrency of 16).