Deploying Hugging Face Transformer Models in Triton#

There are multiple ways to run Llama2 with Tritonserver.

  1. Infer with TensorRT-LLM Backend

  2. Infer with vLLM Backend

  3. Infer with Python-based Backends as a HuggingFace model

Pre-build instructions#

For the tutorials we are assuming that the Llama2 models, weights, and tokens are cloned from the Huggingface Llama2 repo here. To run the tutorials, you will need to get permissions for the Llama2 repository as well as access to the huggingface cli. The cli uses User access tokens. The tokens can be found here: huggingface.co/settings/tokens.