The following figure shows the Triton Inference Server high-level architecture. The model repository is a file-system based repository of the models that the inference server will make available for inferencing. Inference requests arrive at the server via either HTTP or GRPC or by C API and are then routed to the appropriate per-model scheduler. The inference server implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s configured scheduler optionally performs batching of inference requests and then passes the requests to the framework backend corresponding to the model type. The framework backend performs inferencing using the inputs provided in the request to produce the requested outputs. The outputs are then formatted and a response is sent.
Concurrent Model Execution¶
The Triton Inference Server architecture allows multiple models and/or multiple instances of the same model to execute in parallel on a single GPU. The following figure shows an example with two models; model0 and model1. Assuming the server is not currently processing any request, when two requests arrive simultaneously, one for each model, the server immediately schedules both of them onto the GPU and the GPU’s hardware scheduler begins working on both computations in parallel.
By default, if multiple requests for the same model arrive at the same time, the inference server will serialize their execution by scheduling only one at a time on the GPU, as shown in the following figure.
The Triton inference server provides an instance-group feature that allows each model to specify how many parallel executions of that model should be allowed. Each such enabled parallel execution is referred to as an execution instance. By default, the server gives each model a single execution instance, which means that only a single execution of the model is allowed to be in progress at a time as shown in the above figure. By using instance-group the number of execution instances for a model can be increased. The following figure shows model execution when model1 is configured to allow three execution instances. As shown in the figure, the first three model1 inference requests are immediately executed in parallel on the GPU. The fourth model1 inference request must wait until one of the first three executions completes before beginning.
To provide the current model execution capabilities shown in the above figures, the inference server uses CUDA streams to exploit the GPU’s hardware scheduling capabilities. CUDA streams allow the server to communicate independent sequences of memory-copy and kernel executions to the GPU. The hardware scheduler in the GPU takes advantage of the independent execution streams to fill the GPU with independent memory-copy and kernel executions. For example, using streams allows the GPU to execute a memory-copy for one model, a kernel for another model, and a different kernel for yet another model at the same time.
The following figure shows some details of how this works within the Triton Inference Server. Each framework backend (TensorRT, TensorFlow, PyTorch, ONNX, etc.) provides an API for creating an execution context that is used to execute a given model (each framework uses different terminology for this concept but here we refer to them generally as execution contexts). Each framework allows an execution context to be associated with a CUDA stream. This CUDA stream is used by the framework to execute all memory copies and kernels needed for the model associated with the execution context. For a given model, the inference server creates one execution context for each execution instance specified for the model. When an inference request arrives for a given model, that request is queued in the model scheduler associated with that model. The model scheduler waits for any execution context associated with that model to be idle and then sends the queued request to the context. The execution context then issues all the memory copies and kernel executions required to execute the model to the CUDA stream associated with that execution context. The memory copies and kernels in each CUDA stream are independent of memory copies and kernels in other CUDA streams. The GPU hardware scheduler looks across all CUDA streams to find independent memory copies and kernels to execute on the GPU.