Deploying Hermes-2-Pro-Llama-3-8B Model with Triton Inference Server#
The Hermes-2-Pro-Llama-3-8B is an advanced language model developed by NousResearch. This model is an enhancement of the Meta-Llama-3-8B finetuned in-house using the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed by NousResearch. These advancements enable the model to excel in both general conversational tasks and specialized functions like structured JSON outputs and function calling, making it a versatile tool for various applications.
The model is available for download through huggingface.
TensorRT-LLM is Nvidia’s recommended solution of running Large Language Models(LLMs) on Nvidia GPUs. Read more about TensoRT-LLM here and Triton’s TensorRT-LLM Backend here.
NOTE: If some parts of this tutorial doesn’t work, it is possible that there
are some version mismatches between the tutorials
and tensorrtllm_backend
repository. Refer to llama.md
for more detailed modifications if necessary. And if you are familiar with
python, you can also try using
High-level API
for LLM workflow.
Prerequisite: TensorRT-LLM backend#
This tutorial requires TensorRT-LLM Backend repository. Please note,
that for best user experience we recommend using the latest
release tag
of tensorrtllm_backend
and
the latest Triton Server container.
To clone TensorRT-LLM Backend repository, make sure to run the following set of commands.
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git --branch <release branch>
# Update the submodules
cd tensorrtllm_backend
# Install git-lfs if needed
apt-get update && apt-get install git-lfs -y --no-install-recommends
git lfs install
git submodule update --init --recursive
Launch Triton TensorRT-LLM container#
Launch Triton docker container with TensorRT-LLM backend.
Note that we’re mounting tensorrtllm_backend
to /tensorrtllm_backend
and the Hermes model to /Hermes-2-Pro-Llama-3-8B
in the docker container for
simplicity. Make an engines
folder outside docker to reuse engines for future
runs. Please, make sure to replace <xx.yy> with the version of Triton that you
want to use.
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v </path/to/tensorrtllm_backend>:/tensorrtllm_backend \
-v </path/to/Hermes/repo>:/Hermes-2-Pro-Llama-3-8B \
-v </path/to/engines>:/engines \
nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3
Alternatively, you can follow instructions here to build Triton Server with Tensorrt-LLM Backend if you want to build a specialized container.
Don’t forget to allow gpu usage when you launch the container.
Create Engines for each model [skip this step if you already have an engine]#
TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Triton Server you will need to create a TensorRT-LLM engine.
Triton Server TensrRT-LLM container comes with pre-installed TensorRT-LLM package, which allows users to build engines inside the Triton container. Simply follow the next steps:
HF_LLAMA_MODEL=/Hermes-2-Pro-Llama-3-8B
UNIFIED_CKPT_PATH=/tmp/ckpt/hermes/8b/
ENGINE_DIR=/engines
CONVERT_CHKPT_SCRIPT=/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py
python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir ${ENGINE_DIR} \
--paged_kv_cache enable \
--max_batch_size 4
Optional: You can check test the output of the model with
run.py
located in the same llama examples folder.python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 28 --tokenizer_dir ${HF_LLAMA_MODEL} --input_text "What is ML?"You should expect the following response:
Input [Text 0]: "<|begin_of_text|>What is ML?" Output [Text 0 Beam 0]: " Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed."
Serving with Triton#
The last step is to create a Triton readable model. You can find a template of a model that uses inflight batching in tensorrtllm_backend/all_models/inflight_batcher_llm. To run our model, you will need to:
Copy over the inflight batcher models repository
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
Modify
config.pbtxt
for the preprocessing, postprocessing and processing steps. The following script do a minimized configuration to run tritonserver, but if you want optimal performance or custom parameters, read details in documentation and perf_best_practices:
# preprocessing
TOKENIZER_DIR=/Hermes-2-Pro-Llama-3-8B/
TOKENIZER_TYPE=auto
DECOUPLED_MODE=false
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=10000
TRTLLM_BACKEND=python
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRTLLM_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching
Launch Tritonserver
[!NOTE] This tutorial was prepared for serving a TensorRT-LLM model on a single GPU. Thus, in the following command use
--world_size=1
if the engine was built for a single GPU. Alternatively, if the engine requires multiple GPUs make sure to specify the exact number of GPUs required by the engine in--world_size
.
Use the launch_triton_server.py script. This launches multiple instances of tritonserver
with MPI.
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
You should expect the following response:
... I0503 22:01:25.210518 1175 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001 I0503 22:01:25.211612 1175 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000 I0503 22:01:25.254914 1175 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002
To stop Triton Server inside the container, run:
pkill tritonserver
Send an inference request#
You can test the results of the run with:
The inflight_batcher_llm_client.py script.
First, let’s start Triton SDK container:
# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v /path/to/tensorrtllm_backend/inflight_batcher_llm/client:/tensorrtllm_client \
-v /path/to/Hermes-2-Pro-Llama-3-8B/repo:/Hermes-2-Pro-Llama-3-8B \
nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
Additionally, please install extra dependencies for the script:
pip3 install transformers sentencepiece
python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len 28 --tokenizer-dir /Hermes-2-Pro-Llama-3-8B --text "What is ML?"
You should expect the following response:
... Input: What is ML? Output beam 0: ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation. ...
The generate endpoint.
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
You should expect the following response:
{"context_logits":0.0,...,"text_output":"What is ML?\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed."}
References#
For more examples feel free to refer to End to end workflow to run llama.