Deploy Hugging Face Models by Exporting to TensorRT-LLM#
This section shows how to use scripts and APIs to export a Hugging Face model to TensorRT-LLM, and deploy it with the NVIDIA Triton Inference Server.
Quick Example#
Pull down and run the Docker container image using the command shown below. Change the
:vr
tag to the version of the container you want to use:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ -w /opt/NeMo \ nvcr.io/nvidia/nemo:vr
Run the following deployment script to verify that everything is working correctly. The script exports the Hugging Face model to TensorRT-LLM and subsequently serves it on the Triton server:
python scripts/deploy/nlp/deploy_triton.py \ --hf_model_id_path meta-llama/Meta-Llama-3-8B-Instruct \ --model_type LlamaForCausalLM \ --triton_model_name llama \ --tensor_parallelism_size 1
If the test yields a shared memory-related error, increase the shared memory size using
--shm-size
(gradually by 50%, for example).In a separate terminal, run the following command to get the container ID of the running container:
docker ps
Access the running container and replace
container_id
with the actual container ID as follows:docker exec -it container_id bash
To send a query to the Triton server, run the following script:
python scripts/deploy/nlp/query.py -mn llama -p "What is the color of a banana?" -mol 5
Use a Script to Deploy Hugging Face Models on a Triton Server#
You can deploy a Hugging Face model on Triton using the provided script.
Export and Deploy a Hugging Face Model#
After executing the script, it will export the model to TensorRT-LLM and then initiate the service on Triton.
Start the container using the steps described in the Quick Example section.
If you want to deploy a model that needs to be downloaded from Hugging Face, you need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.
Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here
Note
If you’re using a locally downloaded model, you don’t need to provide a Hugging Face token unless the model requires it for downloading additional resources.
To begin serving a Hugging Face model, you can use either a model ID from the Hugging Face hub or a path to a locally downloaded model:
Using a Hugging Face model ID:
python scripts/deploy/nlp/deploy_triton.py \ --hf_model_id_path meta-llama/Meta-Llama-3-8B-Instruct \ --model_type LlamaForCausalLM \ --triton_model_name llama \ --tensor_parallelism_size 1
To use a locally downloaded model:
python scripts/deploy/nlp/deploy_triton.py \ --hf_model_id_path /path/to/your/local/model \ --model_type LlamaForCausalLM \ --triton_model_name llama \ --tensor_parallelism_size 1
The following parameters are defined in the
deploy_triton.py
script:--hf_model_id_path
: path or identifier of the Hugging Face model. This can be either: - A Hugging Face model ID (e.g., “meta-llama/Meta-Llama-3-8B-Instruct”) - A local path to a downloaded model directory (e.g., “/path/to/your/local/model”)--model_type
: type of the model. See the table below for supported model types.--triton_model_name
: name of the model on Triton.--triton_model_version
: version of the model. Default is 1.--triton_port
: port for the Triton server to listen for requests. Default is 8000.--triton_http_address
: HTTP address for the Triton server. Default is 0.0.0.0.--triton_model_repository
: TensorRT temp folder. Default is/tmp/trt_llm_model_dir/
.--tensor_parallelism_size
: number of GPUs to split the tensors for tensor parallelism. Default is 1.--pipeline_parallelism_size
: number of GPUs to split the model for pipeline parallelism. Default is 1.--dtype
: data type of the model on TensorRT-LLM. Default is “bfloat16”. Currently, only “bfloat16” is supported.--max_input_len
: maximum input length of the model. Default is 256.--max_output_len
: maximum output length of the model. Default is 256.--max_batch_size
: maximum batch size of the model. Default is 8.--max_num_tokens
: maximum number of tokens. Default is None.--opt_num_tokens
: optimum number of tokens. Default is None.
The following table shows the supported Hugging Face model types and their corresponding
model_type
values:Hugging Face Model
model_type
GPT2LMHeadModel
GPTForCausalLM
GPT2LMHeadCustomModel
GPTForCausalLM
GPTBigCodeForCausalLM
GPTForCausalLM
Starcoder2ForCausalLM
GPTForCausalLM
JAISLMHeadModel
GPTForCausalLM
GPTForCausalLM
GPTForCausalLM
NemotronForCausalLM
GPTForCausalLM
OPTForCausalLM
OPTForCausalLM
BloomForCausalLM
BloomForCausalLM
RWForCausalLM
FalconForCausalLM
FalconForCausalLM
FalconForCausalLM
PhiForCausalLM
PhiForCausalLM
Phi3ForCausalLM
Phi3ForCausalLM
Phi3VForCausalLM
Phi3ForCausalLM
Phi3SmallForCausalLM
Phi3ForCausalLM
PhiMoEForCausalLM
Phi3ForCausalLM
MambaForCausalLM
MambaForCausalLM
GPTNeoXForCausalLM
GPTNeoXForCausalLM
GPTJForCausalLM
GPTJForCausalLM
MptForCausalLM
MPTForCausalLM
MPTForCausalLM
MPTForCausalLM
GLMModel
ChatGLMForCausalLM
ChatGLMModel
ChatGLMForCausalLM
ChatGLMForCausalLM
ChatGLMForCausalLM
ChatGLMForConditionalGeneration
ChatGLMForCausalLM
LlamaForCausalLM
LLaMAForCausalLM
LlavaLlamaModel
LLaMAForCausalLM
ExaoneForCausalLM
LLaMAForCausalLM
MistralForCausalLM
LLaMAForCausalLM
MixtralForCausalLM
LLaMAForCausalLM
ArcticForCausalLM
LLaMAForCausalLM
Grok1ModelForCausalLM
GrokForCausalLM
InternLMForCausalLM
LLaMAForCausalLM
InternLM2ForCausalLM
LLaMAForCausalLM
InternLMXComposer2ForCausalLM
LLaMAForCausalLM
GraniteForCausalLM
LLaMAForCausalLM
GraniteMoeForCausalLM
LLaMAForCausalLM
MedusaForCausalLM
MedusaForCausalLm
MedusaLlamaForCausalLM
MedusaForCausalLm
ReDrafterForCausalLM
ReDrafterForCausalLM
BaichuanForCausalLM
BaichuanForCausalLM
BaiChuanForCausalLM
BaichuanForCausalLM
SkyworkForCausalLM
LLaMAForCausalLM
GEMMA
GemmaForCausalLM
GEMMA2
GemmaForCausalLM
QWenLMHeadModel
QWenForCausalLM
QWenForCausalLM
QWenForCausalLM
Qwen2ForCausalLM
QWenForCausalLM
Qwen2MoeForCausalLM
QWenForCausalLM
Qwen2ForSequenceClassification
QWenForCausalLM
Qwen2VLForConditionalGeneration
QWenForCausalLM
Qwen2VLModel
QWenForCausalLM
WhisperEncoder
WhisperEncoder
EncoderModel
EncoderModel
DecoderModel
DecoderModel
DbrxForCausalLM
DbrxForCausalLM
RecurrentGemmaForCausalLM
RecurrentGemmaForCausalLM
CogVLMForCausalLM
CogVLMForCausalLM
DiT
DiT
DeepseekForCausalLM
DeepseekForCausalLM
DeciLMForCausalLM
DeciLMForCausalLM
DeepseekV2ForCausalLM
DeepseekV2ForCausalLM
EagleForCausalLM
EagleForCausalLM
CohereForCausalLM
CohereForCausalLM
MLLaMAModel
MLLaMAForCausalLM
MllamaForConditionalGeneration
MLLaMAForCausalLM
BertForQuestionAnswering
BertForQuestionAnswering
BertForSequenceClassification
BertForSequenceClassification
BertModel
BertModel
RobertaModel
RobertaModel
RobertaForQuestionAnswering
RobertaForQuestionAnswering
RobertaForSequenceClassification
RobertaForSequenceClassification
Whenever the script is executed, it initiates the service by exporting the Hugging Face model to TensorRT-LLM. If you want to skip the exporting step in the optimized inference option, you can specify an empty directory to save the TensorRT-LLM engine produced. Stop the running container and then run the following command to specify an empty directory:
mkdir tmp_triton_model_repository docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ -v ${PWD}:/opt/checkpoints/ \ -w /opt/NeMo \ nvcr.io/nvidia/nemo:vr python scripts/deploy/nlp/deploy_triton.py \ --hf_model_id_path /path/to/your/local/model \ --model_type LlamaForCausalLM \ --triton_model_name llama \ --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ --tensor_parallelism_size 1
The model will be exported to the specified folder after executing the script mentioned above so that it can be reused later.
To load the exported model directly, run the following script within the container:
python scripts/deploy/nlp/deploy_triton.py \ --triton_model_name llama \ --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ --model_type LlamaForCausalLM
Use NeMo Export and Deploy Module APIs to Run Inference#
Up until now, we have used scripts for exporting and deploying Hugging Face models. However, NeMo’s deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Hugging Face models to TensorRT-LLM.
Export a Hugging Face Model to TensorRT-LLM#
You can use the APIs in the export module to export a Hugging Face model to TensorRT-LLM. The following code example assumes the /opt/checkpoints/tmp_trt_llm
path exists.
Run the following command:
from nemo.export.tensorrt_llm import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") # Using a Hugging Face model ID trt_llm_exporter.export_hf_model( hf_model_path="meta-llama/Meta-Llama-3-8B-Instruct", model_type="LlamaForCausalLM", tensor_parallelism_size=1, ) # Or using a local model path trt_llm_exporter.export_hf_model( hf_model_path="/path/to/your/local/model", model_type="LlamaForCausalLM", tensor_parallelism_size=1, ) trt_llm_exporter.forward( ["What is the best city in the world?"], max_output_token=15, top_k=1, top_p=0.0, temperature=1.0, )
Be sure to check the TensorRTLLM class docstrings for details.
Deploy a Hugging Face Model to TensorRT-LLM#
You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the /opt/checkpoints/tmp_trt_llm
path exists.
Run the following command:
from nemo.export.tensorrt_llm import TensorRTLLM from nemo.deploy import DeployPyTriton trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") # Using a Hugging Face model ID trt_llm_exporter.export_hf_model( hf_model_path="meta-llama/Llama-2-7b-hf", model_type="LlamaForCausalLM", tensor_parallelism_size=1, ) # Or using a local model path trt_llm_exporter.export_hf_model( hf_model_path="/path/to/your/local/model", model_type="LlamaForCausalLM", tensor_parallelism_size=1, ) nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama", http_port=8000) nm.deploy() nm.serve()