Deploying Llama2-7B Model with Triton and vLLM#
The vLLM Backend uses vLLM to do inference. Read more about vLLM here and the vLLM Backend here.
Pre-build instructions#
For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Please follow the README.md for pre-build instructions and links for how to run Llama with other backends.
Installation#
The triton vLLM container can be pulled from NGC with
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $PWD/llama2vllm:/opt/tritonserver/model_repository/llama2vllm \
nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3
This will create a /opt/tritonserver/model_repository
folder that contains the llama2vllm
model. The model itself will be pulled from the HuggingFace
Once in the container, install the huggingface-cli
and login with your own credentials.
pip install --upgrade huggingface_hub
huggingface-cli login --token <your huggingface access token>
Serving with Triton#
Then you can run the tritonserver as usual
tritonserver --model-repository model_repository
The server has launched successfully when you see the following outputs in your console:
I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
Sending requests via the generate
endpoint#
As a simple example to make sure the server works, you can use the generate
endpoint to test. More about the generate endpoint here.
$ curl -X POST localhost:8000/v2/models/llama2vllm/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
# returns (formatted for better visualization)
> {
"model_name":"llama2vllm",
"model_version":"1",
"text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
}
Sending requests via the Triton client#
The Triton vLLM Backend repository has a samples folder that has an example client.py to test the Llama2 model.
pip3 install tritonclient[all]
# Assuming Tritonserver server is running already
$ git clone https://github.com/triton-inference-server/vllm_backend.git
$ cd vllm_backend/samples
$ python3 client.py -m llama2vllm
The following steps should result in a results.txt
that has the following content
Hello, my name is
I am a 20 year old student from the Netherlands. I am currently
=========
The most dangerous animal is
The most dangerous animal is the one that is not there.
The most dangerous
=========
The capital of France is
The capital of France is Paris.
The capital of France is Paris. The
=========
The future of AI is
The future of AI is in the hands of the people who use it.
=========