Deploying Hugging Face Transformer Models in Triton#

The following tutorial demonstrates how to deploy an arbitrary hugging face transformer model on the Triton Inference Server using Triton’s Python backend. For the purposes of this example, the following transformer models will be deployed:

These models were selected because of their popularity and consistent response quality. However, this tutorial is also generalizable for any transformer model provided sufficient infrastructure.

NOTE: The tutorial is intended to be a reference example only. It may not be tuned for optimal performance.

NOTE: Llama 2 models are not specifically mentioned in the steps below, but can be run if tiiuae/falcon-7b is replaced with meta-llama/Llama-2-7b-hf, and falcon7b folder is replaced by llama7b folder.

Step 1: Create a Model Repository#

The first step is to create a model repository containing the models we want the Triton Inference Server to load and use for inference processing. To accomplish this, create a directory called model_repository and copy the falcon7b model folder into it:

mkdir -p model_repository
cp -r falcon7b/ model_repository/

The falcon7b/ folder we copied is organized in the way Triton expects and contains two important files needed to serve models in Triton:

config.pbtxt - Outlines the backend to use, model input/output details, and custom parameters to use for execution. More information on the full range of model configuration properties Triton supports can be found here.
model.py - Implements how Triton should handle the model during the initialization, execution, and finalization stages. More information regarding python backend usage can be found here.

Step 2: Build a Triton Container Image#

The second step is to create an image that includes all the dependencies necessary to deploy hugging face transformer models on the Triton Inference Server. This can be done by building an image from the provided Dockerfile:

docker build -t triton_transformer_server .

Step 3: Launch the Triton Inference Server#

Once the triton_transformer_server image is created, you can launch the Triton Inference Server in a container with the following command:

docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server tritonserver --model-repository=model_repository

Note: For private models like Llama2, you need to request access to the model and add the access token to the docker command -e PRIVATE_REPO_TOKEN=<hf_your_huggingface_access_token>.

docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -e PRIVATE_REPO_TOKEN=<hf_your_huggingface_access_token> -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server tritonserver --model-repository=model_repository

The server has launched successfully when you see the following outputs in your console:

I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Step 4: Query the Server#

Now we can query the server using curl, specifying the server address and input details:

curl -X POST localhost:8000/v2/models/falcon7b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["I am going"]}]}'

In our testing, the server returned the following result (formatted for legibility):

{
  "model_name": "falcon7b",
  "model_version": "1",
  "outputs": [
    {
      "name": "text",
      "datatype": "BYTES",
      "shape": [
        1
      ],
      "data": [
        "I am going to be in the market for a new laptop soon. I"
      ]
    }
  ]
}

Step 5: Host Multiple Models in Triton#

So far in this tutorial, we have only loaded a single model. However, Triton is capable of hosting many models, simultaneously. To accomplish this, first ensure you have exited the docker container by invoking Ctrl+C and waiting for the container to exit.

Next copy the remaining model provided into the model repository:

cp -r persimmon8b/ model_repository/

NOTE: The combined size of these two models is large. If your current hardware cannot support hosting both models simultaneously, consider loading a smaller model, such as opt-125m, by creating a folder for it using the templates provided and copying it into model_repository.

Again, launch the server by invoking the docker run command from above and wait for confirmation that the server has launched successfully.

Query the server making sure to change the host address for each model:

curl -X POST localhost:8000/v2/models/falcon7b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["How can you be"]}]}'
curl -X POST localhost:8000/v2/models/persimmon8b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["Where is the nearest"]}]}'

In our testing, these queries returned the following parsed results:

# falcon7b
"How can you be sure that you are getting the best deal on your car"

# persimmon8b
"Where is the nearest starbucks?"

Beginning in the 23.10 release, users can now interact with large language models (LLMs) hosted by Triton in a simplified fashion by using Triton’s generate endpoint:

curl -X POST localhost:8000/v2/models/falcon7b/generate -d '{"text_input":"How can you be"}'

‘Day Zero’ Support#

The latest transformer models may not always be supported in the most recent, official release of the transformers package. In such a case, you should still be able to load these ‘bleeding edge’ models in Triton by building transformers from source. This can be done by replacing the transformers install directive in the provided Dockerfile with:

RUN pip install git+https://github.com/huggingface/transformers.git

Using this technique you should be able to serve any transformer models supported by hugging face with Triton.

Next Steps#

The following sections expand on the base tutorial and provide guidance for future sandboxing.

Loading Cached Models#

In the previous steps, we downloaded the falcon-7b model from hugging face when we launched the Triton server. We can avoid this lengthy download process in subsequent runs by loading cached models into Triton. By default, the provided model.py files will cache the falcon and persimmon models in their respective directories within the model_repository folder. This is accomplished by setting the TRANSFORMERS_CACHE environmental variable. To set this environmental variable for an abtitrary model, include the following lines in your model.py before importing the ‘transformers’ module, making sure to replace {MODEL} with your target model.

import os
os.environ['TRANSFORMERS_CACHE'] = '/opt/tritonserver/model_repository/{MODEL}/hf_cache'

Alternatively, if your system has already cached a hugging face model you wish to deploy in Triton, you can mount it to the Triton container by adding the following mount option to the docker run command from earlier (making sure to replace ${HOME} with the path to your associated username’s home directory):

# Option to mount a specific cached model (falcon-7b in this case)
-v ${HOME}/.cache/huggingface/hub/models--tiiuae--falcon-7b:/root/.cache/huggingface/hub/models--tiiuae--falcon-7b

# Option to mount all cached models on the host system
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface

Triton Tool Ecosystem#

Deploying models in Triton also comes with the benefit of access to a fully-supported suite of deployment analyzers to help you better understand and tailor your systems to fit your needs. Triton currently has two options for deployment analysis:

Performance Analyzer: An inference performance optimizer.
Model Analyzer A GPU memory and compute utilization optimizer.

Performance Analyzer#

To use the performance analyzer, please remove the persimmon8b model from model_repository and restart the Triton server using the docker run command from above.

Once Triton launches successfully, start a Triton SDK container by running the following in a separate window:

docker run -it --net=host nvcr.io/nvidia/tritonserver:23.10-py3-sdk bash

This container comes with all of Triton’s deployment analyzers pre-installed, meaning we can simply enter the following to get feedback on our model’s inference performance:

perf_analyzer -m falcon7b --collect-metrics

This command should run quickly and profile the performance of our falcon7b model. As the analyzer runs, it will output useful metrics such as latency percentiles, latency by stage of inference, and successful request count. A subset of the output data is shown below:

#Avg request latency
46307 usec (overhead 25 usec + queue 25 usec + compute input 26 usec + compute infer 46161 usec + compute output 68 usec)

#Avg GPU Utilization
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 66.0556%

#Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 21.3841 infer/sec, latency 46783 usec

These metrics tell us that we are not fully utilizing our hardware and that our throughput is low. We can immediately improve these results by batching our requests instead of computing inferences one at a time. The model.py file for the falcon model is already configured to handle batched requests. Enabling batching in Triton is as simple as adding the following to falcon’s config.pbtxt file:

dynamic_batching { }
max_batch_size: 8

The integer corresponding to the max_batch_size, can be any of your choosing, however, for this example, we select 8. Now let’s re-run the perf_analyzer with increasing levels of concurrency and see how it impacts GPU utilization and throughput by executing:

perf_analyzer -m falcon7b --collect-metrics --concurrency-range=2:16:2

After executing for a few minutes, the performance analyzer should return results similar to these (depending on hardware):

# Concurrency = 4
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 74.1111%
Throughput: 31.8264 infer/sec, latency 125174 usec

# Concurrency = 8
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 81.7895%
Throughput: 46.2105 infer/sec, latency 172920 usec

# Concurrency = 16
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 90.5556%
Throughput: 53.6549 infer/sec, latency 299178 usec

Using the performance analyzer we were able to quickly profile different model configurations to obtain better throughput and hardware utilization. In this case, we were able to identify a configuration that nearly triples our throughput and increases GPU utilization by ~24% in less than 5 minutes.

This is a single, simple use case for the performance analyzer. For more information and a more complete list of performance analyzer parameters and use cases, please see this guide.

For more information regarding dynamic batching in Triton, please see this guide.

Model Analyzer#

In the performance analyzer section, we used intuition to increase our throughput by changing a subset of variables and measuring the difference in performance. However, we only changed a few variables across a wide search space.

To sweep this parameter space in a more robust fashion, we can use Triton’s model analyzer, which not only sweeps a large spectrum of configuration parameters, but also generates visual reports to analyze post-execution.

To use the model analyzer, please terminate your Triton server by invoking Ctrl+C and relaunching it with the following command (ensuring the dynamic_batching parameters from above have been added to the falcon model’s config.pbtxt):

docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server

Next, to get the most accurate GPU metrics from the model analyzer, we will install and launch it from our local server container. To accomplish this, first install the model analyzer:

pip3 install triton-model-analyzer

Once the model analyzer installs successfully, enter the following command (modifying the instance count to something lower for your GPU, if necessary):

model-analyzer profile -m /opt/tritonserver/model_repository/ --profile-models falcon7b --run-config-search-max-instance-count=3 --run-config-search-min-model-batch-size=8

This tool will take longer to execute than the performance analyzer example (~40 minutes). If this execution time is too long, you can also run the analyzer with the --run-config-search-mode quick option. In our experimentation, enabling the quick search option yielded fewer results but took half the time. Regardless, once the model analyzer is complete, it will provide you a full summary relating to throughput, latency, and hardware utilization in multiple formats. A snippet from the summary report produced by the model analyzer for our run is ranked by performance and shown below:

Model Config Name	Max Batch Size	Dynamic Batching	Total Instance Count	p99 Latency (ms)	Throughput (infer/sec)	Max GPU Memory Usage (MB)	Average GPU Utilization (%)
falcon7b_config_7	16	Enabled	3:GPU	1412.581	71.944	46226	100.0
falcon7b_config_8	32	Enabled	3:GPU	2836.225	63.9652	46268	100.0
falcon7b_config_4	16	Enabled	2:GPU	7601.437	63.9454	31331	100.0
falcon7b_config_default	8	Enabled	1:GPU	4151.873	63.9384	16449	89.3

We can examine the performance of any of these configurations with more granularity by viewing their detailed reports. This subset of reports focuses on a single configuration’s latency and concurrency metrics as they relate to throughput and hardware utilization. A snippet from the top performing configuration for our tests is shown below (abridged for brevity):

Request Concurrency	p99 Latency (ms)	Client Response Wait (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Throughput (infer/sec)	Max GPU Memory Usage (MB)	Average GPU Utilization (%)
512	8689.491	8190.506	7397.975	0.166	778.565	63.954	46230.667264	100.0
				…
128	2289.118	2049.37	1277.34	0.159	770.771	61.2953	46230.667264	100.0
64	1412.581	896.924	227.108	0.157	667.757	71.944	46226.47296	100.0
32	781.362	546.35	86.078	0.103	459.257	57.7877	46226.47296	100.0
				…
1	67.12	49.707	0.049	0.024	49.121	20.0993	46207.598592	54.9

Similarly, this is a single use case for the model analyzer. For more information and a more complete list of model analyzer parameters and run options, please see this guide.

Please note that both the performance and model analyzer experiments were conducted on a system with an Intel i9 and NVIDIA A6000 GPU. Your results may vary depending on you hardware.

Customization#

The model.py files have been kept minimal in order to maximize generalizability. Should you wish to modify the behavior of the transformer models, such as increasing the number of generated sequences to return, be sure to modify the corresponding config.pbtxt and model.py files and copy them into the model_repository.

The transformers used in this tutorial were all suited for text-generation tasks, however, this is not a limitation. The principles of this tutorial can be applied to serve models suited for any other transformer task.

Triton offers a rich variety of available server configuration options not mentioned in this tutorial. For a more custom deployment, please see our model configuration guide to see how the scope of this tutorial can be expanded to fit your needs.