Deploying with Docker
NVIDIA NIM for LLMs is intended to be run on a system with NVIDIA Datacenter GPUs, with the exact requirements depending on the specific models and deployment options.
For full systems hardware and software requirements see the Support Matrix topic for details. For information about the models supported by the different containers, and the GPUs needed to run the models, see Models.
Dedicated quickstart guides are available for the following models:
Pre-requisite software
To run LLM NIMs, you’ll need a container runtime with support for NVIDIA GPUs. You can set this up with the following steps:
Install Docker
Install the NVIDIA Container Toolkit
Setting up the environment
Download and install the NGC CLI following the NGC Setup steps.
Next, retrieve your NGC API Key which authenticates you, and allows you to download the NIM models and containers. Follow the steps on that page to set the NGC CLI and docker client configs appropriately.
Docker GPU Enumeration
You can specify the number of GPUs to use with the Docker CLI using either the --gpus
option starting with Docker 19.03 or using the environment variable CUDA_VISIBLE_DEVICES
.
Possible Values |
Description |
---|---|
|
A comma-separated list of GPU UUID(s) or index(es). |
|
All GPUs are accessible, this is the default value in base CUDA container images. |
|
No GPU are accessible, but driver capabilities is enabled. |
|
|
Note
Use the device
parameter when using the --gpus
flag to specify the GPUs to use. The format of the device
parameter should be enclosed within single quotes, followed by double quotes for the devices you want assign for use in container. For example: --gpus '"device=2,3"'
assign GPUs 2 and 3 to the container. When using the CUDA_VISIBLE_DEVICES
variable, you may need to set --runtime
to nvidia
unless already set as default.
See the NVIDIA container toolkit documentation for an example of how to launch a GPU enabled container.
Deployment Options
NVIDIA NIM for LLMs uses two different compute backends:
Triton + TensorRT LLM
vLLM
The Triton + TensorRT LLM path is our performance optimized deployment option. vLLM represents our community backend which we use to support the latest models or new features that may not yet be incorporated into the TensorRT LLM optimized stack.
You can list the available NVIDIA NIM for LLMs container images using the following command, which returns a list of the available containers in a comma-separated format:
ngc registry image list --format_type csv nvcr.io/nvidia/nim/*
This returns a list with nine fields:
Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products
You can get information about the available versions of NVIDIA NIM for LLMs with the following command:
ngc registry image info nvcr.io/nvidia/nim/nim_llm
The version tags follow a YY.MM-detail
pattern, where YY.MM
refers to the year and month the container was released, and the optional detail
refers to a particular variant of the image. For example, 24.02
is the latest container with the Triton + TensorRT-LLM backend, while 24.02-day0
is the latest container with the vLLM backend.
To download an container image, run the following command, with the appropriate container version:
docker pull nvcr.io/nvidia/nim/nim_llm:<VERSION>
You can pull the latest Triton + TensorRT-LLM container with
docker pull nvcr.io/nvidia/nim/nim_llm:24.02
And the latest vLLM container with
docker pull nvcr.io/nvidia/nim/nim_llm:24.02-day0
Deploy on TensorRT LLM
This section describes how to deploy a model to the TensorRT LLM (TRT-LLM) backend/container. To generate a model store from a supported model architecture for a different GPU type or configuration, see the Model Repo Generator topic.
The TRT-LLM backend supports the following models:
Llama-2-13b
Llama-2-13b-chat
Mixtral-8x7B-Instruct-v0.1
Mixtral-8x7B-v0.1
Nemotron-3-8B-Base-4k
Nemotron-3-8B-Chat-4k-SteerLM
Nemotron-3-8B-QA-4k
StarCoder
Use the workflow in the previous section to get the nim_llm container. Then follow the instructions in the following section to get a model.
Pull Model
You can retrieve a list of models using the following command:
ngc registry model list nvidia/nim/*
This returns a table with ten columns:
Name,Repository,Latest Version,Application,Framework,Precision,Last Modified,Permission,Access Type,Associated Products
You can get information about an individual model using the following command,
where <MODEL_REPO>
is the value of the Repository field in the response to the command above:
ngc registry model info nvidia/nim/<MODEL_REPO>
To see what versions of a model are available, you can use the following command:
ngc registry model list nvidia/nim/<MODEL_REPO>:*
The versions follow the naming pattern <GPU_TYPE>x<NUM_GPUS>_<precision>_YY.MM
. To deploy a model, you’ll need to make sure that the NVIDIA NIM for LLMs container version and gpu type/count matches the info in the version string of your downloaded model.
To download the desired model version, execute the following command:
ngc registry model download-version nvidia/nim/<MODEL_REPO>:<VERSION>
This will download a folder named <MODEL_REPO>_v<VERSION>
, which you’ll need to mount to your container to launch it.
Once your model is downloaded, you’re ready to launch the NVIDIA NIM for LLMs container. You’ll need to include the information about where your downloaded model is located in the -v
flag, the number of GPUs required in the --gpus
and num_gpus
flags, and the name of the model for the --name
and --model
flags.
docker run --rm -it --name <MODEL_NAME> \
--gpus <NUM_GPUS> \
--shm-size=8G \
-v $(pwd)/<MODEL_REPO>_v<VERSION>:/model-store \
-p 9999:9999 -p 9998:9998 -p 8080:8080 \
nvcr.io/nvidia/nim/nim_llm:24.02 \
nemollm_inference_ms --model <MODEL_NAME> --openai_port="9999" --nemo_port="9998" --num_gpus=<NUM_GPUS> --num_workers=2
Note
You can run with multiple process workers by setting the value of --num_workers
in the following command. For further information on configuring the number of workers, see the Advanced Topics documentation.
Once you’ve started the container you can check whether it’s running and ready to accept requests, as explained in Validating Deployments.
vLLM Deployment
This section describes how to deploy a model with the vLLM backend container.
The vLLM backend supports the following models:
CodeLlama-13b-Instruct-hf
CodeLlama-34b-Instruct-hf
CodeLlama-70b-Instruct-hf
Falcon-40B-Instruct
Gemma 2B Instruct
Gemma 7B Instruct
Llama-2-70b
Llama-2-70b-chat
Llama-2-7b
Llama-2-7b-chat
Llama2-70B-SteerLM-Chat
Phi-2
StarCoder2-15B
StarCoderPlus
Download the Model
Community models are available from many sources, such as the HuggingFace Hub. For example, you can clone a copy of Mistral 7B Instruct v2, using the following git
command.
Note
To get access to the HuggingFace models, see their Requirements page.
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 mistral-7b-instruct-v0.2
Prepare the NIM config file
The NIM service provides the following functionalities:
Ingests a (HF) model, including weights and tokenizer.
Configures a vLLM engine according to the specified model requirements.
Deploys the vLLM engine for usage with OpenAI’s
completions/
andchat/
endpoints.
To construct the HF model and preparation of the vLLM engine, the service requires a vLLM model configuration file. For a detailed description of the fields within the vLLM model configuration file, see Model Configuration Values for vLLM. For information about the settings in vLLM Engine section, see the vLLM engine arguments section in the vLLM start guide.
An example of model_config.yaml
is as follows:
engine:
model: <LOCATION OF MODEL IN CONTAINER>
tensor_parallel_size: <NUM_GPUS>
dtype: float16
Launch Microservice
Similar to TensorRT deployment, NVIDIA NIM for LLMs offers a cli to start the service for handling inference requests. You’ll need to include the information about where your downloaded model and model_config.yaml
are located in the -v
flag, the number of GPUs required in the --gpus
flag, and the name of the model for the --name
and --model
flags.
Note
Be sure that the location you mount your model to matches the model
field in the model_config.yaml
file
docker run --rm -it --name <MODEL_NAME> \
--gpus <NUM_GPUS>\
--shm-size=8G \
-v <LOCATION OF MODEL ON HOST>:<LOCATION OF MODEL IN CONTAINER> \
-v <LOCATION OF MODEL CONFIG ON HOST>:/model_config.yaml \
-p 9999:9999 -p 8080:8080 \
nvcr.io/nvidia/nim/nim_llm:24.02-day0 \
nim_vllm --model_name <MODEL_NAME> --openai_port="9999" --health_port="9998" --model_config /model_config.yaml
Note
You may need to add a --user=root
flag to your docker run command if you receive an error about .cache
not being accessible
Validating Deployments
Once either the TensorRT LLM or vLLM backed microservice has been launched, you can validate the deployment by checking the health endpoints and executing inference requests against the service.
Health and Liveness checks
The container exposes health and liveness endpoints for integration into existing systems such as Kubernetes with both the NemoLLM API and OpenAI API at /v1/health/ready
and /v1/health/live
. These endpoints only return an HTTP 200 OK
status code if the service is
ready or live, respectively.
curl localhost:8080/v1/health/ready
...
{"object":"health-response","message":"Service is ready."}
curl localhost:8080/v1/health/live
...
{"object":"health-response","message":"Service is ready."}
Run inference
Note
You can find further information about completion parameters in OpenAI API and NemoLLM API
OpenAI Completion Request. To stream the result, set "stream": true
. The completions endpoint is generally used for base
models. With the completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen.
curl -X 'POST' \
'http://0.0.0.0:9999/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL NAME>",
"prompt": "Once upon a time",
"max_tokens": 16,
"top_p": 1,
"n": 1,
"stream": false,
"stop": "string",
"frequency_penalty": 0.0
}'
OpenAI Chat Completion Request. To stream the result, set "stream": true
. The chat completions endpoint is generally used for chat
or instruct
tuned models that are desinged to be used through a conversational approach. With the chat completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation.
curl -X 'POST' \
'http://0.0.0.0:9999/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL NAME>",
"messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"Whats your name?"}],
"max_tokens": 16,
"top_p": 1,
"n": 1,
"stream": false,
"stop": "string",
"frequency_penalty": 0.0
}'
Stopping the container
If you launch a Docker container with the --name
command line option, you can execute the Docker stop
and rm
commands using that name, as shown in the following command line examples.
docker stop <MODEL NAME>