Llama 2 13B Chat NIM
Important
NVIDIA NIM currently is in limited availability, sign up here to get notified when the latest NIMs are available to download.
Important
This quickstart guide covers the deployment of the Llama 2 13B Chat model. To learn more about NIM for LLMs, see the Overview and Understanding NIM for LLMs pages.
Llama 2 is a large language AI model comprising a collection of models capable of generating text and code in response to prompts. A more detailed description of the model can be found in the Model Card.
Model Specific Requirements
Hardware
2 A100 or H100 GPU(s) with a minimum of 80 GB of GPU (VRAM) Memory
Software
Minimum NVIDIA Driver Version: 535
Quickstart
This page assumes Prerequisite Software (Docker, NGC CLI, NGC registry access) is installed and set up. This quickstart guide is for deploying on 2 A100 GPU(s). For other configurations, see Available Models.
Pull the NIM container
1docker pull nvcr.io/nvidia/nim/nim_llm:24.02
Pull the model from NGC. This model requires 2 A100 GPU(s) and 25GB of free disk space.
1mkdir ~/nim_model 2ngc registry model download-version nvidia/nim/llama2-13b-chat:a100x2_fp16_24.02 --dest ~/nim_model
Make the model readable within the container
1chmod -R 755 ~/nim_model
Run NIM
1# it may take several minutes to load the models to the GPU and initialize the service completely. 2CUDA_VISIBLE_DEVICES=0,1 NUM_GPUS=2 && docker run --rm -it --runtime=nvidia \ 3-e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \ 4--shm-size=8G \ 5--name "llama2-13b-chat" \ 6-v ~/nim_model/llama2-13b-chat_va100x2_fp16_24.02/ensemble:/model-store/ensemble:ro \ 7-v ~/nim_model/llama2-13b-chat_va100x2_fp16_24.02/trt_llm_0.0.1_trtllm:/model-store/trt_llm_0.0.1_trtllm:ro \ 8-p 9999:9999 -p 9998:9998 -p 8080:8080 \ 9nvcr.io/nvidia/nim/nim_llm:24.02 \ 10nemollm_inference_ms --model llama2-13b-chat --openai_port="9999" --nemo_port="9998" --num_gpus=${NUM_GPUS}
Wait until the health check returns
true
before proceeding.1curl localhost:8080/v1/health/ready
Request Inference from the local NIM instance
1curl -X 'POST' 'http://0.0.0.0:9999/v1/completions' \ 2-H 'accept: application/json' -H 'Content-Type: application/json' \ 3-d '{ "model": "llama2-13b-chat", "prompt": "Once upon a time", "max_tokens": 16, 4"top_p": 1, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0}'
Available Models
Version |
GPU Model |
Number of GPUs |
Precision |
Memory Footprint |
File Size |
---|---|---|---|---|---|
h100x2_fp16_24.02 |
H100 |
2 |
FP16 |
80GB |
25GB |
a100x2_fp16_24.02 |
A100 |
2 |
FP16 |
80GB |
25GB |
Note
If your desired configuration is not present in the above table, you should skip ahead to the vLLM Deployment section.
Detailed Instructions
Throughout these instructions, we will define bash variables that we will reuse:
1MODEL_NAME="llama2-13b-chat"
2MODEL_DIRECTORY=~/nim_model
3mkdir ${MODEL_DIRECTORY}
Pull Container Image
Container image tags follow the versioning of YY.MM, similar to other container images on NGC. You may see different values under “Tags:”. These docs were written based on the latest available at the time.
ngc registry image info nvcr.io/nvidia/nim/nim_llm
1Image Repository Information 2 Name: nim_llm 3 Display Name: nim_llm 4 Short Description: LLM NIM 5 Built By: 6 Publisher: 7 Multinode Support: False 8 Multi-Arch Support: False 9 Logo: https://assets.nvidiagrid.net/ngc/logos/Nemo.png 10 Labels: NVIDIA AI Enterprise Supported 11 Public: No 12 Access Type: 13 Associated Products: [] 14 Last Updated: Mar 14, 2024 15 Latest Image Size: 7.97 GB 16 Signed Tag?: False 17 Latest Tag: 24.02 18 Tags: 19 24.02 20 24.02.rc3 21 24.02.rc2
Pull the container image
docker pull nvcr.io/nvidia/nim/nim_llm:24.02
ngc registry image pull nvcr.io/nvidia/nim/nim_llm:24.02
Pull Model
Model tags follow the versioning of
repository:version
. The model is calledllama2-13b-chat
and the version follows the naming pattern<GPU_TYPE>x<NUM_GPUS>_<precision>_YY.MM.x
. Additional versions are available and can be seen by running the following NGC command line command:ngc registry model list nvidia/nim/llama2-13b-chat:*
Pull the selected model:
ngc registry model download-version nvidia/nim/llama2-13b-chat:a100x2_fp16_24.02 --dest ${MODEL_DIRECTORY}
Launch Microservice
Launch the container. Start-up may take a couple of minutes until the service is available. In this example, we’re hosting the OpenAI API compatible endpoint on port 9999
and the NemoLLM API compatible endpoint on 9998
. The health check endpoint will be hosted on port 8080
. These three ports should be open and available for Docker to correctly bind ports. After you start the Docker command below, you may open another terminal session on the same host and proceed to the next step.
1CUDA_VISIBLE_DEVICES=0,1 NUM_GPUS=2 && docker run --rm -it --runtime=nvidia \
2-e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
3--shm-size=8G \
4--name "llama2-13b-chat" \
5-v ${MODEL_DIRECTORY}/llama2-13b-chat_va100x2_fp16_24.02/ensemble:/model-store/ensemble:ro \
6-v ${MODEL_DIRECTORY}/llama2-13b-chat_va100x2_fp16_24.02/trt_llm_0.0.1_trtllm:/model-store/trt_llm_0.0.1_trtllm:ro \
7-p 9999:9999 -p 9998:9998 -p 8080:8080 \
8 nvcr.io/nvidia/nim/nim_llm:24.02 \
9nemollm_inference_ms --model llama2-13b-chat --openai_port="9999" --nemo_port="9998" --num_gpus=${NUM_GPUS}
Health and Liveness Checks
The container exposes health and liveness endpoints for integration into existing systems such as Kubernetes with both the NemoLLM API and OpenAI API at /v1/health/ready
and /v1/health/live
. These endpoints only return an HTTP 200 OK
status code if the service is ready or live, respectively. Run these in a new terminal. Remember, it may take a few minutes to load the models to the GPU and initialize the service completely.
1curl localhost:8080/v1/health/ready
2...
3{"object":"health-response","message":"Service is ready."}
1curl localhost:8080/v1/health/live
2...
3{"object":"health-response","message":"Service is ready."}
Run Inference
OpenAI Completion Request. To stream the result, set the
"stream": true
. Run these in a new terminal. You can find the OpenAI API documentation here. The completions endpoint is generally used for base models. With the completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen.1curl -X 'POST' \ 2'http://0.0.0.0:9999/v1/completions' \ 3-H 'accept: application/json' \ 4-H 'Content-Type: application/json' \ 5-d '{ 6"model": "'"${MODEL_NAME}"'", 7"prompt": "Once upon a time", 8"max_tokens": 16, 9"top_p": 1, 10"n": 1, 11"stream": false, 12"stop": "string", 13"frequency_penalty": 0.0 14}'
OpenAI Chat Completion Request. To stream the result, set the
"stream": true
. The chat completions endpoint is generally used forchat
orinstruct
tuned models that are desinged to be used through a conversational approach. With the chat completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.1curl -X 'POST' \ 2'http://0.0.0.0:9999/v1/chat/completions' \ 3-H 'accept: application/json' \ 4-H 'Content-Type: application/json' \ 5-d '{ 6"model": "'"${MODEL_NAME}"'", 7"messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"Whats your name?"}], 8"max_tokens": 16, 9"top_p": 1, 10"n": 1, 11"stream": false, 12"stop": "string", 13"frequency_penalty": 0.0 14}'
Stopping the Container
When you’re done testing the endpoint, you can bring down the container by running docker stop ${MODEL_NAME}
in a new terminal.
vLLM Deployment
This example shows how to deploy the Llama 2 13B Chat model on 2 GPU(s), and is compatible with A100 80 GB GPUs. For the list of supported models see the vLLM documentation.
Download the Model
Community models are made available via a variety of means with the HuggingFace Hub being one of the most popular. If you do not already have a local copy of Llama 2 13B Chat, clone it using git via the following command:
Note
To download the Llama 2 13B Chat model from HuggingFace, you’ll need to create a HuggingFace account and apply for access to the model through the HuggingFace Hub.
1# Make sure you have git-lfs installed (https://git-lfs.com)
2git lfs install
3
4
5# When prompted for a password, use an access token with write permissions.
6# Generate one from your settings: https://huggingface.co/settings/tokens
7
8git clone https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
Note that this is a large download, and may take a significant amount of time to complete.
Prepare the NIM config file
The NIM service provides the following functionalities:
Ingests a (HF) model, including weights, tokenizer, etc.
Configures a vLLM engine according to the specified model requirements.
Deploys the vLLM engine for usage with OpenAI’s
completions/
andchat/
endpoints.
To facilitate the construction of the HF model and preparation of the vLLM engine, the service necessitates a vLLM model configuration file. For a detailed description of the fields within the vLLM model configuration file, please refer to the OpenAPI vLLM model config. It is noteworthy that the vLLMEngine section is designed to align with the official vLLM engine arguments outlined in the vLLM start guide.
An example of model_config.yaml
for llama2-13b-chat is as follows:
1engine:
2 model: /Llama-2-13b-chat-hf/
3 tensor_parallel_size: 2
4 dtype: float16
This example config file specifies a 2 GPU deployment – depending on your model and GPU, you may be able to modify the config file and deploy with more or fewer GPUs.
Launch Microservice
Similar to TensorRT deployment, NIM offers a CLI to start the service for handling inference requests. All methods for submitting a request and running a health check are the same.
1docker run --rm -it --name llama2-13b-chat \
2--runtime=nvidia -e CUDA_VISIBLE_DEVICES=0,1 \
3--shm-size=8G \
4-v $(pwd)/Llama-2-13b-chat-hf:/Llama-2-13b-chat-hf \
5-v $(pwd)/model_config.yaml:/model_config.yaml \
6-p 9999:9999 -p 8080:8080 \
7nvcr.io/nvidia/nim/nim_llm:24.02-day0 \
8nim_vllm --model_name llama2-13b-chat --openai_port="9999" --health_port="8080" --model_config /model_config.yaml
Note
You may need to add --user=root
if you receive an error about .cache
not being accessible
Run Inference
OpenAI Completion Request. To stream the result, set the
"stream": true
. Run these in a new terminal. You can find the OpenAI API documentation here. The completions endpoint is generally used for base models. With the completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen.1curl -X 'POST' \ 2'http://0.0.0.0:9999/v1/completions' \ 3-H 'accept: application/json' \ 4-H 'Content-Type: application/json' \ 5-d '{ 6"model": "llama2-13b-chat", 7"prompt": "Once upon a time", 8"max_tokens": 16, 9"top_p": 1, 10"n": 1, 11"stream": false, 12"stop": "string", 13"frequency_penalty": 0.0 14}'
OpenAI Chat Completion Request. To stream the result, set the
"stream": true
. The chat completions endpoint is generally used forchat
orinstruct
tuned models that are desinged to be used through a conversational approach. With the chat completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.1curl -X 'POST' \ 2'http://0.0.0.0:9999/v1/chat/completions' \ 3-H 'accept: application/json' \ 4-H 'Content-Type: application/json' \ 5-d '{ 6"model": "llama2-13b-chat", 7"messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"Whats your name?"}], 8"max_tokens": 16, 9"top_p": 1, 10"n": 1, 11"stream": false, 12"stop": "string", 13"frequency_penalty": 0.0 14}'
When you’re done testing the endpoint, you can bring down the container by running docker stop llama2-13b-chat
in a new terminal.