Deploying Hugging Face Transformer Models in Triton#
The following tutorial demonstrates how to deploy an arbitrary hugging face transformer model on the Triton Inference Server using Triton’s Python backend. For the purposes of this example, the following transformer models will be deployed:
These models were selected because of their popularity and consistent response quality. However, this tutorial is also generalizable for any transformer model provided sufficient infrastructure.
NOTE: The tutorial is intended to be a reference example only. It may not be tuned for optimal performance.
NOTE: Llama 2 models are not specifically mentioned in the steps below, but
can be run if tiiuae/falcon-7b
is replaced with meta-llama/Llama-2-7b-hf
,
and falcon7b
folder is replaced by llama7b
folder.
Step 1: Create a Model Repository#
The first step is to create a model repository containing the models we want the Triton
Inference Server to load and use for inference processing. To accomplish this, create a
directory called model_repository
and copy the falcon7b
model folder into it:
mkdir -p model_repository
cp -r falcon7b/ model_repository/
The falcon7b/
folder we copied is organized in the way Triton expects and contains
two important files needed to serve models in Triton:
config.pbtxt - Outlines the backend to use, model input/output details, and custom parameters to use for execution. More information on the full range of model configuration properties Triton supports can be found here.
model.py - Implements how Triton should handle the model during the initialization, execution, and finalization stages. More information regarding python backend usage can be found here.
Step 2: Build a Triton Container Image#
The second step is to create an image that includes all the dependencies necessary to deploy hugging face transformer models on the Triton Inference Server. This can be done by building an image from the provided Dockerfile:
docker build -t triton_transformer_server .
Step 3: Launch the Triton Inference Server#
Once the triton_transformer_server
image is created, you can launch the Triton Inference
Server in a container with the following command:
docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server tritonserver --model-repository=model_repository
Note: For private models like Llama2
, you need to request access to the model and add the access token to the docker command -e PRIVATE_REPO_TOKEN=<hf_your_huggingface_access_token>
.
docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -e PRIVATE_REPO_TOKEN=<hf_your_huggingface_access_token> -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server tritonserver --model-repository=model_repository
The server has launched successfully when you see the following outputs in your console:
I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
Step 4: Query the Server#
Now we can query the server using curl, specifying the server address and input details:
curl -X POST localhost:8000/v2/models/falcon7b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["I am going"]}]}'
In our testing, the server returned the following result (formatted for legibility):
{
"model_name": "falcon7b",
"model_version": "1",
"outputs": [
{
"name": "text",
"datatype": "BYTES",
"shape": [
1
],
"data": [
"I am going to be in the market for a new laptop soon. I"
]
}
]
}
Step 5: Host Multiple Models in Triton#
So far in this tutorial, we have only loaded a single model. However, Triton is capable
of hosting many models, simultaneously. To accomplish this, first ensure you have
exited the docker container by invoking Ctrl+C
and waiting for the container to exit.
Next copy the remaining model provided into the model repository:
cp -r persimmon8b/ model_repository/
NOTE: The combined size of these two models is large. If your current hardware cannot
support hosting both models simultaneously, consider loading a smaller model, such as
opt-125m, by creating a folder for it
using the templates provided and copying it into model_repository
.
Again, launch the server by invoking the docker run
command from above and wait for confirmation
that the server has launched successfully.
Query the server making sure to change the host address for each model:
curl -X POST localhost:8000/v2/models/falcon7b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["How can you be"]}]}'
curl -X POST localhost:8000/v2/models/persimmon8b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["Where is the nearest"]}]}'
In our testing, these queries returned the following parsed results:
# falcon7b
"How can you be sure that you are getting the best deal on your car"
# persimmon8b
"Where is the nearest starbucks?"
Beginning in the 23.10 release, users can now interact with large language models (LLMs) hosted by Triton in a simplified fashion by using Triton’s generate endpoint:
curl -X POST localhost:8000/v2/models/falcon7b/generate -d '{"text_input":"How can you be"}'
‘Day Zero’ Support#
The latest transformer models may not always be supported in the most recent, official
release of the transformers
package. In such a case, you should still be able to
load these ‘bleeding edge’ models in Triton by building transformers
from source.
This can be done by replacing the transformers install directive in the provided
Dockerfile with:
RUN pip install git+https://github.com/huggingface/transformers.git
Using this technique you should be able to serve any transformer models supported by hugging face with Triton.
Next Steps#
The following sections expand on the base tutorial and provide guidance for future sandboxing.
Loading Cached Models#
In the previous steps, we downloaded the falcon-7b model from hugging face when we
launched the Triton server. We can avoid this lengthy download process in subsequent runs
by loading cached models into Triton. By default, the provided model.py
files will cache
the falcon and persimmon models in their respective directories within the model_repository
folder. This is accomplished by setting the TRANSFORMERS_CACHE
environmental variable.
To set this environmental variable for an abtitrary model, include the following lines in
your model.py
before importing the ‘transformers’ module, making sure to replace
{MODEL}
with your target model.
import os
os.environ['TRANSFORMERS_CACHE'] = '/opt/tritonserver/model_repository/{MODEL}/hf_cache'
Alternatively, if your system has already cached a hugging face model you wish to deploy in Triton,
you can mount it to the Triton container by adding the following mount option to the docker run
command from earlier (making sure to replace ${HOME}
with the path to your associated username’s home directory):
# Option to mount a specific cached model (falcon-7b in this case)
-v ${HOME}/.cache/huggingface/hub/models--tiiuae--falcon-7b:/root/.cache/huggingface/hub/models--tiiuae--falcon-7b
# Option to mount all cached models on the host system
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface
Triton Tool Ecosystem#
Deploying models in Triton also comes with the benefit of access to a fully-supported suite of deployment analyzers to help you better understand and tailor your systems to fit your needs. Triton currently has two options for deployment analysis:
Performance Analyzer: An inference performance optimizer.
Model Analyzer A GPU memory and compute utilization optimizer.
Performance Analyzer#
To use the performance analyzer, please remove the persimmon8b model from model_repository
and restart
the Triton server using the docker run
command from above.
Once Triton launches successfully, start a Triton SDK container by running the following in a separate window:
docker run -it --net=host nvcr.io/nvidia/tritonserver:23.10-py3-sdk bash
This container comes with all of Triton’s deployment analyzers pre-installed, meaning we can simply enter the following to get feedback on our model’s inference performance:
perf_analyzer -m falcon7b --collect-metrics
This command should run quickly and profile the performance of our falcon7b model. As the analyzer runs, it will output useful metrics such as latency percentiles, latency by stage of inference, and successful request count. A subset of the output data is shown below:
#Avg request latency
46307 usec (overhead 25 usec + queue 25 usec + compute input 26 usec + compute infer 46161 usec + compute output 68 usec)
#Avg GPU Utilization
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 66.0556%
#Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 21.3841 infer/sec, latency 46783 usec
These metrics tell us that we are not fully utilizing our hardware and that our
throughput is low. We can immediately improve these results by batching our requests
instead of computing inferences one at a time. The model.py
file for the falcon model
is already configured to handle batched requests. Enabling batching in Triton is as simple
as adding the following to falcon’s config.pbtxt
file:
dynamic_batching { }
max_batch_size: 8
The integer corresponding to the max_batch_size
, can be any of your choosing, however,
for this example, we select 8. Now let’s re-run the perf_analyzer with increasing levels
of concurrency and see how it impacts GPU utilization and throughput by executing:
perf_analyzer -m falcon7b --collect-metrics --concurrency-range=2:16:2
After executing for a few minutes, the performance analyzer should return results similar to these (depending on hardware):
# Concurrency = 4
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 74.1111%
Throughput: 31.8264 infer/sec, latency 125174 usec
# Concurrency = 8
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 81.7895%
Throughput: 46.2105 infer/sec, latency 172920 usec
# Concurrency = 16
GPU-57c7b00e-ca04-3876-91e2-c1eae40a0733 : 90.5556%
Throughput: 53.6549 infer/sec, latency 299178 usec
Using the performance analyzer we were able to quickly profile different model configurations to obtain better throughput and hardware utilization. In this case, we were able to identify a configuration that nearly triples our throughput and increases GPU utilization by ~24% in less than 5 minutes.
This is a single, simple use case for the performance analyzer. For more information and a more complete list of performance analyzer parameters and use cases, please see this guide.
For more information regarding dynamic batching in Triton, please see this guide.
Model Analyzer#
In the performance analyzer section, we used intuition to increase our throughput by changing a subset of variables and measuring the difference in performance. However, we only changed a few variables across a wide search space.
To sweep this parameter space in a more robust fashion, we can use Triton’s model analyzer, which not only sweeps a large spectrum of configuration parameters, but also generates visual reports to analyze post-execution.
To use the model analyzer, please terminate your Triton server by invoking Ctrl+C
and relaunching
it with the following command (ensuring the dynamic_batching parameters from above have been added
to the falcon model’s config.pbtxt):
docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server
Next, to get the most accurate GPU metrics from the model analyzer, we will install and launch it from our local server container. To accomplish this, first install the model analyzer:
pip3 install triton-model-analyzer
Once the model analyzer installs successfully, enter the following command (modifying the instance count to something lower for your GPU, if necessary):
model-analyzer profile -m /opt/tritonserver/model_repository/ --profile-models falcon7b --run-config-search-max-instance-count=3 --run-config-search-min-model-batch-size=8
This tool will take longer to execute than the performance analyzer example (~40 minutes).
If this execution time is too long, you can also run the analyzer with the
--run-config-search-mode quick
option. In our experimentation, enabling the quick search option
yielded fewer results but took half the time. Regardless, once the model analyzer is complete,
it will provide you a full summary relating to throughput, latency, and hardware utilization
in multiple formats. A snippet from the summary report produced by the model analyzer for
our run is ranked by performance and shown below:
Model Config Name |
Max Batch Size |
Dynamic Batching |
Total Instance Count |
p99 Latency (ms) |
Throughput (infer/sec) |
Max GPU Memory Usage (MB) |
Average GPU Utilization (%) |
---|---|---|---|---|---|---|---|
falcon7b_config_7 |
16 |
Enabled |
3:GPU |
1412.581 |
71.944 |
46226 |
100.0 |
falcon7b_config_8 |
32 |
Enabled |
3:GPU |
2836.225 |
63.9652 |
46268 |
100.0 |
falcon7b_config_4 |
16 |
Enabled |
2:GPU |
7601.437 |
63.9454 |
31331 |
100.0 |
falcon7b_config_default |
8 |
Enabled |
1:GPU |
4151.873 |
63.9384 |
16449 |
89.3 |
We can examine the performance of any of these configurations with more granularity by viewing their detailed reports. This subset of reports focuses on a single configuration’s latency and concurrency metrics as they relate to throughput and hardware utilization. A snippet from the top performing configuration for our tests is shown below (abridged for brevity):
Request Concurrency |
p99 Latency (ms) |
Client Response Wait (ms) |
Server Queue (ms) |
Server Compute Input (ms) |
Server Compute Infer (ms) |
Throughput (infer/sec) |
Max GPU Memory Usage (MB) |
Average GPU Utilization (%) |
---|---|---|---|---|---|---|---|---|
512 |
8689.491 |
8190.506 |
7397.975 |
0.166 |
778.565 |
63.954 |
46230.667264 |
100.0 |
… |
||||||||
128 |
2289.118 |
2049.37 |
1277.34 |
0.159 |
770.771 |
61.2953 |
46230.667264 |
100.0 |
64 |
1412.581 |
896.924 |
227.108 |
0.157 |
667.757 |
71.944 |
46226.47296 |
100.0 |
32 |
781.362 |
546.35 |
86.078 |
0.103 |
459.257 |
57.7877 |
46226.47296 |
100.0 |
… |
||||||||
1 |
67.12 |
49.707 |
0.049 |
0.024 |
49.121 |
20.0993 |
46207.598592 |
54.9 |
Similarly, this is a single use case for the model analyzer. For more information and a more complete list of model analyzer parameters and run options, please see this guide.
Please note that both the performance and model analyzer experiments were conducted on a system with an Intel i9 and NVIDIA A6000 GPU. Your results may vary depending on you hardware.
Customization#
The model.py
files have been kept minimal in order to maximize generalizability. Should you wish
to modify the behavior of the transformer models, such as increasing the number of generated sequences
to return, be sure to modify the corresponding config.pbtxt
and model.py
files and copy them
into the model_repository
.
The transformers used in this tutorial were all suited for text-generation tasks, however, this is not a limitation. The principles of this tutorial can be applied to serve models suited for any other transformer task.
Triton offers a rich variety of available server configuration options not mentioned in this tutorial. For a more custom deployment, please see our model configuration guide to see how the scope of this tutorial can be expanded to fit your needs.