Deploying your trained model using Triton#

Given a trained model, how do I deploy it at-scale with an optimal configuration using Triton Inference Server? This document is here to help answer that.

For those who like a high level overview, below is the common flow for most use cases.

For those who wish to jump right in, skip to the end-to-end example.

For additional material, see the Triton Conceptual Guide tutorial.

Overview#

Is my model compatible with Triton?
- If your model falls under one of Triton’s supported backends, then we can simply try to deploy the model as described in the Quickstart guide. For the ONNXRuntime, TensorFlow SavedModel, and TensorRT backends, the minimal model configuration can be inferred from the model using Triton’s AutoComplete feature. This means that a config.pbtxt may still be provided, but is not required unless you want to explicitly set certain parameters. Additionally, by enabling verbose logging via --log-verbose=1, you can see the complete config that Triton sees internally in the server log output. For other backends, refer to the Minimal Model Configuration required to get started.
- If your model does not come from a supported backend, you can look into the Python Backend or writing a Custom C++ Backend to support your model. The Python Backend provides a simple interface to execute requests through a generic python script, but may not be as performant as a Custom C++ Backend. Depending on your use case, the Python Backend performance may be a sufficient tradeoff for the simplicity of implementation.
Can I run inference on my served model?
- Assuming you were able to load your model on Triton, the next step is to verify that we can run inference requests and get a baseline performance benchmark of your model. Triton’s Perf Analyzer tool specifically fits this purpose. Here is a simplified output for demonstration purposes:
```
# NOTE: "my_model" represents a model currently being served by Triton
$ perf_analyzer -m my_model
...

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 482.8 infer/sec, latency 12613 usec
```
- This gives us a sanity test that we are able to successfully form input requests and receive output responses to communicate with the model backend via Triton APIs.
- If Perf Analyzer fails to send requests and it is unclear from the error how to proceed, then you may want to sanity check that your model config.pbtxt inputs/outputs match what the model expects. If the config is correct, check that the model runs successfully using its original framework directly. If you don’t have your own script or tool to do so, Polygraphy is a useful tool to run sample inferences on your model via various frameworks. Currently, Polygraphy supports ONNXRuntime, TensorRT, and TensorFlow 1.x.
- The definition of “performing well” is subject to change for each use case. Some common metrics are throughput, latency, and GPU utilization. There are many variables that can be tweaked just within your model configuration (config.pbtxt) to obtain different results.
- As your model, config, or use case evolves, Perf Analyzer is a great tool to quickly verify model functionality and performance.
How can I improve my model performance?
- To further understand the best model configuration you can provide to Triton for your use case, Triton’s Model Analyzer tool can help. Model Analyzer can automatically or manually search through config combinations to find the optimal triton configuration to meet your constraints. After running Model Analyzer to find the optimal configurations for your model/use case, you can transfer the generated config files to your Model Repository. Model Analyzer provides a Quickstart guide with some examples to walk through.
- Upon serving the model with the newly optimized configuration file found by Model Analyzer and running Perf Analyzer again, you should expect to find better performance numbers in most cases compared to a default config.
- Some parameters that can be tuned for a model may not be exposed to Model Analyzer’s automatic search since they don’t apply to all models. For instance, backends can expose backend-specific configuration options that can be tuned as well. The ONNXRuntime Backend, for example, has several parameters that affect the level of parallelization when executing inference on a model. These backend-specific options may be worth investigating if the defaults are not providing sufficient performance. To tune custom sets of parameters, Model Analyzer supports Manual Configuration Search.
- To learn more about further optimizations for your model configuration, see the Optimization docs.

Other Areas of Interest#

My model performs slowly when it is first loaded by Triton (cold-start penalty), what do I do?
- Triton exposes the ability to run ModelWarmup requests when first loading the model to ensure that the model is sufficiently warmed up before being marked “READY” for inference.
Why doesn’t my model perform significantly faster on GPU?
- Most official backends supported by Triton are optimized for GPU inference and should perform well on GPU out of the box.
- Triton exposes options for you to optimize your model further on the GPU. Triton’s Framework Specific Optimizations goes into further detail on this topic.
- Complete conversion of your model to a backend fully optimized for GPU inference such as TensorRT may provide even better results. You may find more Triton-specific details about TensorRT in the TensorRT Backend.
- If none of the above can help get sufficient GPU-accelerated performance for your model, the model may simply be better designed for CPU execution and the OpenVINO Backend may help further optimize your CPU execution.

End-to-end Example#

Note If you have never worked with Triton before, you may be interested in first checking out the Quickstart example. Some basic understanding of Triton may be useful for the following section, but this example is meant to be straightforward enough without prior experience.

Let’s take an ONNX model as our example since ONNX is designed to be a format that can be easily exported from most other frameworks.

Create a Model Repository and download our example densenet_onnx model into it.

# Create model repository with placeholder for model and version 1
mkdir -p ./models/densenet_onnx/1

# Download model and place it in model repository
wget -O models/densenet_onnx/1/model.onnx
https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx

Create a minimal Model Configuration for the densenet_onnx model in our Model Repository at ./models/densenet_onnx/config.pbtxt.

Note This is a slightly simplified version of another example config that utilizes other Model Configuration features not necessary for this example.

name: "densenet_onnx"
backend: "onnxruntime"
max_batch_size: 0
input: [
  {
    name: "data_0",
    data_type: TYPE_FP32,
    dims: [ 1, 3, 224, 224]
  }
]
output: [
  {
    name: "prob_1",
    data_type: TYPE_FP32,
    dims: [ 1, 1000, 1, 1 ]
  }
]

Note As of the 22.07 release, both Triton and Model Analyzer support fully auto-completing the config file for backends that support it. So for an ONNX model, for example, this step can be skipped unless you want to explicitly set certain parameters.

Start the server container

To serve our model, we will use the server container which comes pre-installed with a tritonserver binary.

# Start server container
docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-server nvcr.io/nvidia/tritonserver:24.12-py3

# Start serving your models
tritonserver --model-repository=/mnt/models

Note The -v $PWD:/mnt is mounting your current directory on the host into the /mnt directory inside the container. So if you created your model repository in $PWD/models, you will find it inside the container at /mnt/models. You can change these paths as needed. See docker volume docs for more information on how this works.

To check if the model loaded successfully, we expect to see our model in a READY state in the output of the previous command:

...
I0802 18:11:47.100537 135 model_repository_manager.cc:1345] successfully loaded 'densenet_onnx' version 1
...
+---------------+---------+--------+
| Model         | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1       | READY  |
+---------------+---------+--------+
...

Verify the model can run inference

To verify our model can perform inference, we will use the triton-client container that we already started which comes with perf_analyzer pre-installed.

In a separate shell, we use Perf Analyzer to sanity check that we can run inference and get a baseline for the kind of performance we expect from this model.

In the example below, Perf Analyzer is sending requests to models served on the same machine (localhost from the server container via --network=host). However, you may also test models being served remotely at some <IP>:<PORT> by setting the -u flag, such as perf_analyzer -m densenet_onnx -u 127.0.0.1:8000.

# Start the SDK container interactively
docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-client nvcr.io/nvidia/tritonserver:24.12-py3-sdk

# Benchmark model being served from step 3
perf_analyzer -m densenet_onnx --concurrency-range 1:4

...
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 265.147 infer/sec, latency 3769 usec
Concurrency: 2, throughput: 890.793 infer/sec, latency 2243 usec
Concurrency: 3, throughput: 937.036 infer/sec, latency 3199 usec
Concurrency: 4, throughput: 965.21 infer/sec, latency 4142 usec

Run Model Analyzer to find the best configurations for our model

While Model Analyzer comes pre-installed in the SDK (client) container and supports various modes of connecting to a Triton server, for simplicity we will use install Model Analyzer in our server container to use the local (default) mode. To learn more about other methods of connecting Model Analyzer to a running Triton Server, see the --triton-launch-mode Model Analyzer flag.

# Enter server container interactively
docker exec -ti triton-server bash

# Stop existing tritonserver process if still running
# because model-analyzer will start its own server
SERVER_PID=`ps | grep tritonserver | awk '{ printf $1 }'`
kill ${SERVER_PID}

# Install model analyzer
pip install --upgrade pip
pip install triton-model-analyzer wkhtmltopdf

# Profile the model using local (default) mode
# NOTE: This may take some time, in this example it took ~10 minutes
model-analyzer profile \
  --model-repository=/mnt/models \
  --profile-models=densenet_onnx \
  --output-model-repository-path=results

# Summarize the profiling results
model-analyzer analyze --analysis-models=densenet_onnx

Example Model Analyzer output summary:

In 51 measurements across 6 configurations, densenet_onnx_config_3 provides the best throughput: 323 infer/sec.

This is a 92% gain over the default configuration (168 infer/sec), under the given constraints.

Model Config Name	Dynamic Batching	Instance Count	p99 Latency (ms)	Throughput (infer/sec)	Max GPU Memory Usage (MB)	Average GPU Utilization (%)
densenet_onnx_config_3	Enabled	4/GPU	35.8	323.13	3695	58.6
densenet_onnx_config_2	Enabled	3/GPU	59.575	295.82	3615	58.9
densenet_onnx_config_4	Enabled	5/GPU	69.939	291.468	3966	58.2
densenet_onnx_config_default	Disabled	1/GPU	12.658	167.549	3116	51.3

In the table above, we see that setting our GPU Instance Count to 4 allows us to achieve the highest throughput and almost lowest latency on this system.

Also, note that this densenet_onnx model has a fixed batch-size that is explicitly specified in the first dimension of the Input/Output dims, therefore the max_batch_size parameter is set to 0 as described here. For models that support dynamic batch size, Model Analyzer would also tune the max_batch_size parameter.

Warning These results are specific to the system running the Triton server, so for example, on a smaller GPU we may not see improvement from increasing the GPU instance count. In general, running the same configuration on systems with different hardware (CPU, GPU, RAM, etc.) may provide different results, so it is important to profile your model on a system that accurately reflects where you will deploy your models for your use case.

Extract optimal config from Model Analyzer results

In our example above, densenet_onnx_config_3 was the optimal configuration. So let’s extract that config.pbtxt and put it back in our model repository for future use.

# (optional) Backup our original config.pbtxt (if any) to another directory
cp /mnt/models/densenet_onnx/config.pbtxt /tmp/original_config.pbtxt

# Copy over the optimal config.pbtxt from Model Analyzer results to our model repository
cp ./results/densenet_onnx_config_3/config.pbtxt /mnt/models/densenet_onnx/

Now that we have an optimized Model Configuration, we are ready to take our model to deployment. For further manual tuning, read the Model Configuration and Optimization docs to learn more about Triton’s complete set of capabilities.

In this example, we happened to get both the highest throughput and almost lowest latency from the same configuration, but in some cases this is a tradeoff that must be made. Certain models or configurations may achieve a higher throughput but also incur a higher latency in return. It is worthwhile to fully inspect the reports generated by Model Analyzer to ensure your model performance meets your requirements.