Llama 2 13B Chat NIM

Important

NVIDIA NIM currently is in limited availability, sign up here to get notified when the latest NIMs are available to download.

Important

This quickstart guide covers the deployment of the Llama 2 13B Chat model. To learn more about NIM for LLMs, see the Overview and Understanding NIM for LLMs pages.

Llama 2 is a large language AI model comprising a collection of models capable of generating text and code in response to prompts. A more detailed description of the model can be found in the Model Card.

Model Specific Requirements

Hardware

  • 2 A100 or H100 GPU(s) with a minimum of 80 GB of GPU (VRAM) Memory

Software

  • Minimum NVIDIA Driver Version: 535

Quickstart

This page assumes Prerequisite Software (Docker, NGC CLI, NGC registry access) is installed and set up. This quickstart guide is for deploying on 2 A100 GPU(s). For other configurations, see Available Models.

  1. Pull the NIM container

    1docker pull nvcr.io/nvidia/nim/nim_llm:24.02
    
  2. Pull the model from NGC. This model requires 2 A100 GPU(s) and 25GB of free disk space.

    1mkdir ~/nim_model
    2ngc registry model download-version nvidia/nim/llama2-13b-chat:a100x2_fp16_24.02 --dest ~/nim_model
    
  3. Make the model readable within the container

    1chmod -R 755 ~/nim_model
    
  4. Run NIM

     1# it may take several minutes to load the models to the GPU and initialize the service completely.
     2CUDA_VISIBLE_DEVICES=0,1 NUM_GPUS=2 && docker run --rm -it --runtime=nvidia \
     3-e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
     4--shm-size=8G \
     5--name "llama2-13b-chat" \
     6-v ~/nim_model/llama2-13b-chat_va100x2_fp16_24.02/ensemble:/model-store/ensemble:ro \
     7-v ~/nim_model/llama2-13b-chat_va100x2_fp16_24.02/trt_llm_0.0.1_trtllm:/model-store/trt_llm_0.0.1_trtllm:ro \
     8-p 9999:9999 -p 9998:9998 -p 8080:8080 \
     9nvcr.io/nvidia/nim/nim_llm:24.02 \
    10nemollm_inference_ms --model llama2-13b-chat --openai_port="9999" --nemo_port="9998" --num_gpus=${NUM_GPUS}
    
  5. Wait until the health check returns true before proceeding.

    1curl localhost:8080/v1/health/ready
    
  6. Request Inference from the local NIM instance

    1curl -X 'POST' 'http://0.0.0.0:9999/v1/completions' \
    2-H 'accept: application/json' -H 'Content-Type: application/json' \
    3-d '{ "model": "llama2-13b-chat", "prompt": "Once upon a time", "max_tokens": 16,
    4"top_p": 1, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0}'
    

Available Models

Version

GPU Model

Number of GPUs

Precision

Memory Footprint

File Size

h100x2_fp16_24.02

H100

2

FP16

80GB

25GB

a100x2_fp16_24.02

A100

2

FP16

80GB

25GB

Note

If your desired configuration is not present in the above table, you should skip ahead to the vLLM Deployment section.

Detailed Instructions

Throughout these instructions, we will define bash variables that we will reuse:

1MODEL_NAME="llama2-13b-chat"
2MODEL_DIRECTORY=~/nim_model
3mkdir ${MODEL_DIRECTORY}

Pull Container Image

  1. Container image tags follow the versioning of YY.MM, similar to other container images on NGC. You may see different values under “Tags:”. These docs were written based on the latest available at the time. ngc registry image info nvcr.io/nvidia/nim/nim_llm

     1Image Repository Information
     2   Name: nim_llm
     3   Display Name: nim_llm
     4   Short Description: LLM NIM
     5   Built By:
     6   Publisher:
     7   Multinode Support: False
     8   Multi-Arch Support: False
     9   Logo: https://assets.nvidiagrid.net/ngc/logos/Nemo.png
    10   Labels: NVIDIA AI Enterprise Supported
    11   Public: No
    12   Access Type:
    13   Associated Products: []
    14   Last Updated: Mar 14, 2024
    15   Latest Image Size: 7.97 GB
    16   Signed Tag?: False
    17   Latest Tag: 24.02
    18   Tags:
    19      24.02
    20      24.02.rc3
    21      24.02.rc2
    
  2. Pull the container image

    docker pull nvcr.io/nvidia/nim/nim_llm:24.02
    
    ngc registry image pull nvcr.io/nvidia/nim/nim_llm:24.02
    

Pull Model

  1. Model tags follow the versioning of repository:version. The model is called llama2-13b-chat and the version follows the naming pattern <GPU_TYPE>x<NUM_GPUS>_<precision>_YY.MM.x. Additional versions are available and can be seen by running the following NGC command line command:

    ngc registry model list nvidia/nim/llama2-13b-chat:*
    
  2. Pull the selected model:

    ngc registry model download-version nvidia/nim/llama2-13b-chat:a100x2_fp16_24.02 --dest ${MODEL_DIRECTORY}
    

Launch Microservice

Launch the container. Start-up may take a couple of minutes until the service is available. In this example, we’re hosting the OpenAI API compatible endpoint on port 9999 and the NemoLLM API compatible endpoint on 9998. The health check endpoint will be hosted on port 8080. These three ports should be open and available for Docker to correctly bind ports. After you start the Docker command below, you may open another terminal session on the same host and proceed to the next step.

1CUDA_VISIBLE_DEVICES=0,1 NUM_GPUS=2 && docker run --rm -it --runtime=nvidia \
2-e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
3--shm-size=8G \
4--name "llama2-13b-chat" \
5-v ${MODEL_DIRECTORY}/llama2-13b-chat_va100x2_fp16_24.02/ensemble:/model-store/ensemble:ro \
6-v ${MODEL_DIRECTORY}/llama2-13b-chat_va100x2_fp16_24.02/trt_llm_0.0.1_trtllm:/model-store/trt_llm_0.0.1_trtllm:ro \
7-p 9999:9999 -p 9998:9998 -p 8080:8080 \
8  nvcr.io/nvidia/nim/nim_llm:24.02 \
9nemollm_inference_ms --model llama2-13b-chat --openai_port="9999" --nemo_port="9998" --num_gpus=${NUM_GPUS}

Health and Liveness Checks

The container exposes health and liveness endpoints for integration into existing systems such as Kubernetes with both the NemoLLM API and OpenAI API at /v1/health/ready and /v1/health/live. These endpoints only return an HTTP 200 OK status code if the service is ready or live, respectively. Run these in a new terminal. Remember, it may take a few minutes to load the models to the GPU and initialize the service completely.

1curl localhost:8080/v1/health/ready
2...
3{"object":"health-response","message":"Service is ready."}
1curl localhost:8080/v1/health/live
2...
3{"object":"health-response","message":"Service is ready."}

Run Inference

  1. OpenAI Completion Request. To stream the result, set the "stream": true. Run these in a new terminal. You can find the OpenAI API documentation here. The completions endpoint is generally used for base models. With the completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen.

     1curl -X 'POST' \
     2'http://0.0.0.0:9999/v1/completions' \
     3-H 'accept: application/json' \
     4-H 'Content-Type: application/json' \
     5-d '{
     6"model": "'"${MODEL_NAME}"'",
     7"prompt": "Once upon a time",
     8"max_tokens": 16,
     9"top_p": 1,
    10"n": 1,
    11"stream": false,
    12"stop": "string",
    13"frequency_penalty": 0.0
    14}'
    
  2. OpenAI Chat Completion Request. To stream the result, set the "stream": true. The chat completions endpoint is generally used for chat or instruct tuned models that are desinged to be used through a conversational approach. With the chat completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.

     1curl -X 'POST' \
     2'http://0.0.0.0:9999/v1/chat/completions' \
     3-H 'accept: application/json' \
     4-H 'Content-Type: application/json' \
     5-d '{
     6"model": "'"${MODEL_NAME}"'",
     7"messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"Whats your name?"}],
     8"max_tokens": 16,
     9"top_p": 1,
    10"n": 1,
    11"stream": false,
    12"stop": "string",
    13"frequency_penalty": 0.0
    14}'
    

Stopping the Container

When you’re done testing the endpoint, you can bring down the container by running docker stop ${MODEL_NAME} in a new terminal.

vLLM Deployment

This example shows how to deploy the Llama 2 13B Chat model on 2 GPU(s), and is compatible with A100 80 GB GPUs. For the list of supported models see the vLLM documentation.

Download the Model

Community models are made available via a variety of means with the HuggingFace Hub being one of the most popular. If you do not already have a local copy of Llama 2 13B Chat, clone it using git via the following command:

Note

To download the Llama 2 13B Chat model from HuggingFace, you’ll need to create a HuggingFace account and apply for access to the model through the HuggingFace Hub.

1# Make sure you have git-lfs installed (https://git-lfs.com)
2git lfs install
3
4
5# When prompted for a password, use an access token with write permissions.
6# Generate one from your settings: https://huggingface.co/settings/tokens
7
8git clone https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

Note that this is a large download, and may take a significant amount of time to complete.

Prepare the NIM config file

The NIM service provides the following functionalities:

  1. Ingests a (HF) model, including weights, tokenizer, etc.

  2. Configures a vLLM engine according to the specified model requirements.

  3. Deploys the vLLM engine for usage with OpenAI’s completions/ and chat/ endpoints.

To facilitate the construction of the HF model and preparation of the vLLM engine, the service necessitates a vLLM model configuration file. For a detailed description of the fields within the vLLM model configuration file, please refer to the OpenAPI vLLM model config. It is noteworthy that the vLLMEngine section is designed to align with the official vLLM engine arguments outlined in the vLLM start guide.

An example of model_config.yaml for llama2-13b-chat is as follows:

1engine:
2  model: /Llama-2-13b-chat-hf/
3  tensor_parallel_size: 2
4  dtype: float16

This example config file specifies a 2 GPU deployment – depending on your model and GPU, you may be able to modify the config file and deploy with more or fewer GPUs.

Launch Microservice

Similar to TensorRT deployment, NIM offers a CLI to start the service for handling inference requests. All methods for submitting a request and running a health check are the same.

1docker run --rm -it --name llama2-13b-chat \
2--runtime=nvidia -e CUDA_VISIBLE_DEVICES=0,1 \
3--shm-size=8G \
4-v $(pwd)/Llama-2-13b-chat-hf:/Llama-2-13b-chat-hf \
5-v $(pwd)/model_config.yaml:/model_config.yaml \
6-p 9999:9999 -p 8080:8080 \
7nvcr.io/nvidia/nim/nim_llm:24.02-day0 \
8nim_vllm --model_name llama2-13b-chat --openai_port="9999" --health_port="8080" --model_config /model_config.yaml

Note

You may need to add --user=root if you receive an error about .cache not being accessible

Run Inference

  1. OpenAI Completion Request. To stream the result, set the "stream": true. Run these in a new terminal. You can find the OpenAI API documentation here. The completions endpoint is generally used for base models. With the completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen.

     1curl -X 'POST' \
     2'http://0.0.0.0:9999/v1/completions' \
     3-H 'accept: application/json' \
     4-H 'Content-Type: application/json' \
     5-d '{
     6"model": "llama2-13b-chat",
     7"prompt": "Once upon a time",
     8"max_tokens": 16,
     9"top_p": 1,
    10"n": 1,
    11"stream": false,
    12"stop": "string",
    13"frequency_penalty": 0.0
    14}'
    
  2. OpenAI Chat Completion Request. To stream the result, set the "stream": true. The chat completions endpoint is generally used for chat or instruct tuned models that are desinged to be used through a conversational approach. With the chat completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.

     1curl -X 'POST' \
     2'http://0.0.0.0:9999/v1/chat/completions' \
     3-H 'accept: application/json' \
     4-H 'Content-Type: application/json' \
     5-d '{
     6"model": "llama2-13b-chat",
     7"messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"Whats your name?"}],
     8"max_tokens": 16,
     9"top_p": 1,
    10"n": 1,
    11"stream": false,
    12"stop": "string",
    13"frequency_penalty": 0.0
    14}'
    

When you’re done testing the endpoint, you can bring down the container by running docker stop llama2-13b-chat in a new terminal.