Getting Started
Setup
NVIDIA GPU(s): NVIDIA NIM for LLMs runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. Homogenous multi-GPUs systems with tensor parallelism enabled are also supported. See the Support Matrix for more information.
CPU: x86_64 architecture only for this release
OS: any Linux distros which:
Have glibc >= 2.35 (see output of
ld -v
)
CUDA Drivers: Follow the installation guide. We recommend:
Using a network repository as part of a package manager installation, skipping the cuda toolkit installation as the libraries will be available within the NIM container, then
Installing the open kernels for a specific version:
Major Version
EOL
Data Center & RTX/Quadro GPUs
GeForce GPUs
> 550
TBD X X 550
Feb. 2025 X X 545
Oct. 2023 X X 535
June 2026 X 525
Nov. 2023 X 470
Sept. 2024 X
Install Docker
Install the NVIDIA Container Toolkit
After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.
To ensure things are working, run the following command:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
This will produce an output similar to the one below for your own system, where you can confirm CUDA driver version, and available GPUs:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 |
| N/A 36C P0 112W / 700W | 78489MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
NGC Authentication
Generate an API key
An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.
When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.
Export the API key
This key will need to be passed to docker run
in the next section as the NGC_API_KEY
environment variable to download the appropriate models and resources when starting the NIM.
If you’re not familiar with how to do this, the simplest way is to export it in your terminal:
export NGC_API_KEY=<value>
Run one of the following to make it available at startup:
# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc
# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc
Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_API_KEY_FILE
, or using a password manager.
Docker login to NGC
To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Use $oauthtoken
as the username and NGC_API_KEY
as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.
NGC CLI Tool
A few times throughout this documentation, the ngc
CLI tool will be used.
Before continuing, please refer to the NGC CLI documentation
for information on how to download and configure the tool.
Note that the ngc
tool used to use the environment variable NGC_API_KEY
but has deprecated
that in favor of NGC_CLI_API_KEY
. In the previous section, you set NGC_API_KEY
and it will
be used in future commands. If you run ngc
with this variable set, you will get a warning saying
it is deprecated in favor of NGC_CLI_API_KEY
. This can be safely ignored for now. You can set
NGC_CLI_API_KEY
, but so long as NGC_API_KEY
is set, you will still get the warning.
The below command will launch a Docker container for the meta/llama3-8b-instruct
model.
# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama3-8b-instruct
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nvidian/nim-llm-dev/${CONTAINER_NAME}:24.05.rc15"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Flags |
Description |
---|---|
-it |
--interactive + --tty (see Docker docs) |
--rm |
Delete the container after it stops (see Docker docs) |
--name=meta-llama3-8b-instruct |
Give a name to the NIM container for bookkeeping (here meta-llama3-8b-instruct ). Use any preferred value. |
--runtime=nvidia |
Ensure NVIDIA drivers are accessible in the container. |
--gpus all |
Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs. |
--shm-size=16GB |
Allocate host memory for multi-gpu communication. Not required for single GPU models or GPUs with NVLink enabled. |
-e NGC_API_KEY |
Provide the container with the token necessary to download adequate models and resources from NGC. See above. |
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" |
Mount a cache directory from your system (~/.cache/nim here) inside the NIM (defaults to /opt/nim/.cache ), allowing downloaded models and artifacts to be reused by follow-up runs. |
-u $(id -u) |
Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory. |
-p 8000:8000 |
Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (8000 here), while the right-hand side is the container port where the NIM server is published (defaults to 8000 ). |
$IMG_NAME |
Name and version of the LLM NIM container from NGC. The LLM NIM server will start automatically if no argument is provided after this. |
See the Configurating a NIM topic for information about additional configuration settings.
NIM will automatically select the most suitable profile based on your system specification. For more details, see the section on Automatic Profile Selection
During startup the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Once you see this message you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:
curl -X GET 'http://0.0.0.0:8000/v1/models'
Pipe the results of curl
commands into a tool like jq
or python -m json.tool
to make the output of the API easier to read.
For example: curl -s http://0.0.0.0:8000/v1/models | jq
.
This command should produce output similar to the following:
{
"object": "list",
"data": [
{
"id": "meta-llama3-8b-instruct",
"object": "model",
"created": 1715659875,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-e39aaffe7015444eba964fa7736ae653",
"object": "model_permission",
"created": 1715659875,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
OpenAI Completion Request
The Completions endpoint is generally used for base
models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true
.
Update the model name to suit your requirements. For example, for a meta/llama3-70b-instruct
model,
you might use the following command:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"prompt": "Once upon a time",
"max_tokens": 64
}'
You can also use the OpenAI Python API library
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
prompt = "Once upon a time"
response = client.completions.create(
model="meta/llama3-8b-instruct",
prompt=prompt,
max_tokens=16,
stream=False
)
completion = response.choices[0].text
print(completion)
# Prints:
# , there was a young man named Jack who lived in a small village at the
OpenAI Chat Completion Request
The Chat Completions endpoint is typically used with chat
or instruct
tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true
.
Update model name according to your requirements. For example, for a meta/llama3-70b-instruct
model, you might use the following command:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"max_tokens": 32
}'
You can also use the OpenAI Python API library
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "user", "content": "Hello! How are you?"},
{"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
{"role": "user", "content": "Write a short limerick about the wonders of GPU computing."}
]
chat_response = client.chat.completions.create(
model="meta/llama3-8b-instruct",
messages=messages,
max_tokens=32,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
# Prints:
# ChatCompletionMessage(content='There once was a GPU so fine,\nProcessed data in parallel so divine,\nIt crunched with great zest,\nAnd computational quest,\nUnleashing speed, a true wonder sublime!', role='assistant', function_call=None, tool_calls=None)
If you encounter a BadRequestError
with an error message indicating that you are missing the messages
or prompt
field, you might inadvertently be using the wrong endpoint.
For instance, if you make a Completions request with a request body intended for Chat Completions, you will see the following error:
{
"object": "error",
"message": "[{'type': 'missing', 'loc': ('body', 'prompt'), 'msg': 'Field required', ...",
"type": "BadRequestError",
"param": null,
"code": 400
}
Conversely, if you make a Chat Completions request with a request body intended for Completions, you will see the following error:
{
"object": "error",
"message": "[{'type': 'missing', 'loc': ('body', 'messages'), 'msg': 'Field required', ...",
"type": "BadRequestError",
"param": null,
"code": 400
}
Verify that the endpoint you are using (i.e /v1/completions
or /v1/chat/completions
) is correctly configured for your request.
Parameter-Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models. Currently NIM only supports LoRA PEFT. See Parameter-Efficient Fine-Tuning for details.
If a Docker container is launched with the --name
command line option, you can use the following command to stop the running container.
# In the previous sections, the environment variable CONTAINER_NAME was
# defined using `export CONTAINER_NAME=meta-llama3-8b-instruct`
docker stop $CONTAINER_NAME
Use docker kill
if stop
is not responsive. Follow either by docker rm $CONTAINER_NAME
if you do not intend to restart this container as-is (with docker start $CONTAINER_NAME
), in which case you will need to re-use the docker run ...
instructions from the top of this section to start a new container for your NIM.
If you did not start a container with --name
, look at the output of docker ps
to get a container ID for the given image you used.