Deploying with Docker

NVIDIA NIM for LLMs is intended to be run on a system with NVIDIA Datacenter GPUs, with the exact requirements depending on the specific models and deployment options.

For full systems hardware and software requirements see the Support Matrix topic for details. For information about the models supported by the different containers, and the GPUs needed to run the models, see Models.

Dedicated quickstart guides are available for the following models:

Pre-requisite software

To run LLM NIMs, you’ll need a container runtime with support for NVIDIA GPUs. You can set this up with the following steps:

Setting up the environment

Download and install the NGC CLI following the NGC Setup steps.

Next, retrieve your NGC API Key which authenticates you, and allows you to download the NIM models and containers. Follow the steps on that page to set the NGC CLI and docker client configs appropriately.

Docker GPU Enumeration

You can specify the number of GPUs to use with the Docker CLI using either the --gpus option starting with Docker 19.03 or using the environment variable CUDA_VISIBLE_DEVICES.

Possible Values

Description

0,1,2... or GPU-fef8089b

A comma-separated list of GPU UUID(s) or index(es).

all

All GPUs are accessible, this is the default value in base CUDA container images.

none

No GPU are accessible, but driver capabilities is enabled.

void or empty or unset

nvidia-container-runtime has the same behavior as runc (neither GPUs nor capabilities are exposed)

Note

Use the device parameter when using the --gpus flag to specify the GPUs to use. The format of the device parameter should be enclosed within single quotes, followed by double quotes for the devices you want assign for use in container. For example: --gpus '"device=2,3"' assign GPUs 2 and 3 to the container. When using the CUDA_VISIBLE_DEVICES variable, you may need to set --runtime to nvidia unless already set as default.

See the NVIDIA container toolkit documentation for an example of how to launch a GPU enabled container.

Deployment Options

NVIDIA NIM for LLMs uses two different compute backends:

  • Triton + TensorRT LLM

  • vLLM

The Triton + TensorRT LLM path is our performance optimized deployment option. vLLM represents our community backend which we use to support the latest models or new features that may not yet be incorporated into the TensorRT LLM optimized stack.

You can list the available NVIDIA NIM for LLMs container images using the following command, which returns a list of the available containers in a comma-separated format:

ngc registry image list --format_type csv nvcr.io/nvidia/nim/*

This returns a list with nine fields:

Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products

You can get information about the available versions of NVIDIA NIM for LLMs with the following command:

ngc registry image info nvcr.io/nvidia/nim/nim_llm

The version tags follow a YY.MM-detail pattern, where YY.MM refers to the year and month the container was released, and the optional detail refers to a particular variant of the image. For example, 24.02 is the latest container with the Triton + TensorRT-LLM backend, while 24.02-day0 is the latest container with the vLLM backend.

To download an container image, run the following command, with the appropriate container version:

docker pull nvcr.io/nvidia/nim/nim_llm:<VERSION>

You can pull the latest Triton + TensorRT-LLM container with

docker pull nvcr.io/nvidia/nim/nim_llm:24.02

And the latest vLLM container with

docker pull nvcr.io/nvidia/nim/nim_llm:24.02-day0

Deploy on TensorRT LLM

This section describes how to deploy a model to the TensorRT LLM (TRT-LLM) backend/container. To generate a model store from a supported model architecture for a different GPU type or configuration, see the Model Repo Generator topic.

The TRT-LLM backend supports the following models:

  • Llama-2-13b

  • Llama-2-13b-chat

  • Mixtral-8x7B-Instruct-v0.1

  • Mixtral-8x7B-v0.1

  • Nemotron-3-8B-Base-4k

  • Nemotron-3-8B-Chat-4k-SteerLM

  • Nemotron-3-8B-QA-4k

  • StarCoder

Use the workflow in the previous section to get the nim_llm container. Then follow the instructions in the following section to get a model.

Pull Model

You can retrieve a list of models using the following command:

ngc registry model list nvidia/nim/*

This returns a table with ten columns:

Name,Repository,Latest Version,Application,Framework,Precision,Last Modified,Permission,Access Type,Associated Products

You can get information about an individual model using the following command, where <MODEL_REPO> is the value of the Repository field in the response to the command above:

ngc registry model info nvidia/nim/<MODEL_REPO>

To see what versions of a model are available, you can use the following command:

ngc registry model list nvidia/nim/<MODEL_REPO>:*

The versions follow the naming pattern <GPU_TYPE>x<NUM_GPUS>_<precision>_YY.MM. To deploy a model, you’ll need to make sure that the NVIDIA NIM for LLMs container version and gpu type/count matches the info in the version string of your downloaded model.

To download the desired model version, execute the following command:

ngc registry model download-version nvidia/nim/<MODEL_REPO>:<VERSION>

This will download a folder named <MODEL_REPO>_v<VERSION>, which you’ll need to mount to your container to launch it.

Once your model is downloaded, you’re ready to launch the NVIDIA NIM for LLMs container. You’ll need to include the information about where your downloaded model is located in the -v flag, the number of GPUs required in the --gpus and num_gpus flags, and the name of the model for the --name and --model flags.

docker run --rm -it --name <MODEL_NAME> \
--gpus <NUM_GPUS> \
--shm-size=8G \
-v $(pwd)/<MODEL_REPO>_v<VERSION>:/model-store \
-p 9999:9999 -p 9998:9998 -p 8080:8080 \
nvcr.io/nvidia/nim/nim_llm:24.02 \
nemollm_inference_ms --model <MODEL_NAME> --openai_port="9999" --nemo_port="9998" --num_gpus=<NUM_GPUS> --num_workers=2

Note

You can run with multiple process workers by setting the value of --num_workers in the following command. For further information on configuring the number of workers, see the Advanced Topics documentation.

Once you’ve started the container you can check whether it’s running and ready to accept requests, as explained in Validating Deployments.

vLLM Deployment

This section describes how to deploy a model with the vLLM backend container.

The vLLM backend supports the following models:

  • CodeLlama-13b-Instruct-hf

  • CodeLlama-34b-Instruct-hf

  • CodeLlama-70b-Instruct-hf

  • Falcon-40B-Instruct

  • Gemma 2B Instruct

  • Gemma 7B Instruct

  • Llama-2-70b

  • Llama-2-70b-chat

  • Llama-2-7b

  • Llama-2-7b-chat

  • Llama2-70B-SteerLM-Chat

  • Phi-2

  • StarCoder2-15B

  • StarCoderPlus

Download the Model

Community models are available from many sources, such as the HuggingFace Hub. For example, you can clone a copy of Mistral 7B Instruct v2, using the following git command.

Note

To get access to the HuggingFace models, see their Requirements page.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 mistral-7b-instruct-v0.2

Prepare the NIM config file

The NIM service provides the following functionalities:

  1. Ingests a (HF) model, including weights and tokenizer.

  2. Configures a vLLM engine according to the specified model requirements.

  3. Deploys the vLLM engine for usage with OpenAI’s completions/ and chat/ endpoints.

To construct the HF model and preparation of the vLLM engine, the service requires a vLLM model configuration file. For a detailed description of the fields within the vLLM model configuration file, see Model Configuration Values for vLLM. For information about the settings in vLLM Engine section, see the vLLM engine arguments section in the vLLM start guide.

An example of model_config.yaml is as follows:

engine:
  model: <LOCATION OF MODEL IN CONTAINER>
  tensor_parallel_size: <NUM_GPUS>
  dtype: float16

Launch Microservice

Similar to TensorRT deployment, NVIDIA NIM for LLMs offers a cli to start the service for handling inference requests. You’ll need to include the information about where your downloaded model and model_config.yaml are located in the -v flag, the number of GPUs required in the --gpus flag, and the name of the model for the --name and --model flags.

Note

Be sure that the location you mount your model to matches the model field in the model_config.yaml file

docker run --rm -it --name <MODEL_NAME> \
   --gpus <NUM_GPUS>\
   --shm-size=8G \
   -v <LOCATION OF MODEL ON HOST>:<LOCATION OF MODEL IN CONTAINER> \
   -v <LOCATION OF MODEL CONFIG ON HOST>:/model_config.yaml \
   -p 9999:9999 -p 8080:8080 \
   nvcr.io/nvidia/nim/nim_llm:24.02-day0 \
   nim_vllm --model_name <MODEL_NAME> --openai_port="9999" --health_port="9998" --model_config /model_config.yaml

Note

You may need to add a --user=root flag to your docker run command if you receive an error about .cache not being accessible

Validating Deployments

Once either the TensorRT LLM or vLLM backed microservice has been launched, you can validate the deployment by checking the health endpoints and executing inference requests against the service.

Health and Liveness checks

The container exposes health and liveness endpoints for integration into existing systems such as Kubernetes with both the NemoLLM API and OpenAI API at /v1/health/ready and /v1/health/live. These endpoints only return an HTTP 200 OK status code if the service is ready or live, respectively.

curl localhost:8080/v1/health/ready
...
{"object":"health-response","message":"Service is ready."}
curl localhost:8080/v1/health/live
...
{"object":"health-response","message":"Service is ready."}

Run inference

Note

You can find further information about completion parameters in OpenAI API and NemoLLM API

OpenAI Completion Request. To stream the result, set "stream": true. The completions endpoint is generally used for base models. With the completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen.

curl -X 'POST' \
'http://0.0.0.0:9999/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
  "model": "<MODEL NAME>",
  "prompt": "Once upon a time",
  "max_tokens": 16,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "string",
  "frequency_penalty": 0.0
}'

OpenAI Chat Completion Request. To stream the result, set "stream": true. The chat completions endpoint is generally used for chat or instruct tuned models that are desinged to be used through a conversational approach. With the chat completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.

curl -X 'POST' \
'http://0.0.0.0:9999/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
  "model": "<MODEL NAME>",
  "messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"Whats your name?"}],
  "max_tokens": 16,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "string",
  "frequency_penalty": 0.0
}'

Stopping the container

If you launch a Docker container with the --name command line option, you can execute the Docker stop and rm commands using that name, as shown in the following command line examples.

docker stop <MODEL NAME>