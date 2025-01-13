Refer to the Support Matrix to make sure hardware and software requirements are met.

Enable WSL2 on your Windows computer by following the steps listed in Install WSL command . By default these steps install the Ubuntu distribution of Linux. For a list of alternative installations, see Change the default Linux distribution installed .

Be sure your computer is capable of running WSL2 as described in the Prerequisites section of the WSL2 documentation.

Certain downloadable NIMs can be used on an RTX Windows system with Windows System for Linux (WSL). To enable WSL2, perform the following steps.

This command should produce output similar to one of the following, where you can confirm CUDA driver version, and available GPUs.

To ensure that your setup is correct, run the following command (refer to the GPU Selection section for a note on using --gpus all ):

After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

Using a network repository as part of a package manager installation , skipping the CUDA toolkit installation as the libraries are available within the NIM container, then

Have glibc >= 2.35 (see output of ld -v )

Are supported by the NVIDIA Container toolkit

CPU : x86_64 architecture only for this release

NVIDIA GPU(s) : NVIDIA NIM for LLMs (NIM for LLMs) runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. Homogeneous multi-GPUs systems with tensor parallelism enabled are also supported. See the Support Matrix for more information.

NVIDIA AI Enterprise License : NVIDIA NIM for LLMs are available for self-hosting under the NVIDIA AI Enterprise License. Sign up for NVIDIA AI Enterprise license .

# Choose a path on your system to cache the downloaded models

# The container name from the previous ngc registgry image list command

To deploy models that don’t fit on a single node, see Multi-node Deployments

Which should produce output like the following:

You can tell that you have the correct Repository value by getting information about the model with the following command:

The following command launches a Docker container for the llama3-8b-instruct model. To launch a container for a different NIM, replace the value of Repository with the value from the previous image list command and change the value of CONTAINER_NAME to something appropriate.

Use the Repository field when you call the docker run command, as shown in the following section.

This command should produce output in the following format:

Use the following command to list the available NIMs, in CSV format.

This documentation uses the ngc CLI tool in a number of examples. See the NGC CLI documentation for information on downloading and configure the tool.

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

Other, more secure options include saving the value in a file, so that you can retrieve with something like cat $NGC_API_KEY_FILE , or using a password manager .

Run one of the following commands to make the key available at startup:

If you’re not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal, as shown in the following example, where VALUE is the value of your API key:

Pass the value of the API key to the docker run command in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.

An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys .

Now, you can jump to running inference .

To modify the docker run parameters, see Docker Run Parameters .

Use the following command to pull and run the NIM using Docker.

Use the docker login command, as shown in the following screenshot, to log in to Docker. Replace the placeholders for Username and Password with your values.

Copy your key and store it in a secure place. Do not share it.

Select an Input option. The following example is of a model that offers a Docker option. Not all of the models offer this option, but all include a “Get API Key” link.

Checkout this video , which illustrates the following steps.

You can download and run the NIM of your choice from either the API catalog or the NGC.

NIM automatically selects the most suitable profile based on your system specification. For details, see Automatic Profile Selection

If you have an issue with permission mismatches when downloading models in your local cache directory, add the -u $(id -u) option to the docker run call.

See the Configuring a NIM topic for information about additional configuration settings.

Name and version of the LLM NIM container from NGC. The LLM NIM server automatically starts if no argument is provided after this.

Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port ( 8000 here), while the right-hand side is the container port where the NIM server is published (defaults to 8000 ).

Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.

Mount a cache directory from your system ( ~/.cache/nim here) inside the NIM (defaults to /opt/nim/.cache ), allowing downloaded models and artifacts to be reused by follow-up runs.

Provide the container with the token necessary to download adequate models and resources from NGC. See Export the API key .

Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled.

Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs.

Ensure NVIDIA drivers are accessible in the container.

Give a name to the NIM container for bookkeeping (here llama3-8b-instruct ). Use any preferred value.

Delete the container after it stops (refer to Docker –rm container command )

Run Inference#

During startup the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 ( Press CTRL+C to quit )

Once you see this message you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:

curl -X GET 'http://0.0.0.0:8000/v1/models'

To make the output easier to read, pipe the results of curl commands into a tool like jq or python -m json.tool . For example: curl -s http://0.0.0.0:8000/v1/models | jq .

This command should produce output similar to the following:

{ "object" : "list" , "data" : [ { "id" : "meta/llama3-8b-instruct" , "object" : "model" , "created" : 1715659875 , "owned_by" : "vllm" , "root" : "meta/llama3-8b-instruct" , "parent" : null, "permission" : [ { "id" : "modelperm-e39aaffe7015444eba964fa7736ae653" , "object" : "model_permission" , "created" : 1715659875 , "allow_create_engine" : false, "allow_sampling" : true, "allow_logprobs" : true, "allow_search_indices" : false, "allow_view" : true, "allow_fine_tuning" : false, "organization" : "*" , "group" : null, "is_blocking" : false } ] } ] }

OpenAI Completion Request# The Completions endpoint is typically used for base models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true . To update the model name, such as for a llama3-8b-instruct model, use the following command: curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "prompt": "Once upon a time", "max_tokens": 64 }' You can also use the OpenAI Python API library. from openai import OpenAI client = OpenAI ( base_url = "http://0.0.0.0:8000/v1" , api_key = "not-used" ) prompt = "Once upon a time" response = client . completions . create ( model = "meta/llama3-8b-instruct" , prompt = prompt , max_tokens = 16 , stream = False ) completion = response . choices [ 0 ] . text print ( completion ) # Prints: # , there was a young man named Jack who lived in a small village at the

OpenAI Chat Completion Request# The Chat Completions endpoint is typically used with chat or instruct tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true . To update model name, such as for a llama3-8b-instruct model, use the following command: curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "messages": [ { "role":"user", "content":"Hello! How are you?" }, { "role":"assistant", "content":"Hi! I am quite well, how can I help you today?" }, { "role":"user", "content":"Can you write me a song?" } ], "max_tokens": 32 }' You can also use the OpenAI Python API library. from openai import OpenAI client = OpenAI ( base_url = "http://0.0.0.0:8000/v1" , api_key = "not-used" ) messages = [ { "role" : "user" , "content" : "Hello! How are you?" }, { "role" : "assistant" , "content" : "Hi! I am quite well, how can I help you today?" }, { "role" : "user" , "content" : "Write a short limerick about the wonders of GPU computing." } ] chat_response = client . chat . completions . create ( model = "meta/llama3-8b-instruct" , messages = messages , max_tokens = 32 , stream = False ) assistant_message = chat_response . choices [ 0 ] . message print ( assistant_message ) Which prints: ChatCompletionMessage ( content = 'There once was a GPU so fine,

Processed data in parallel so divine,

It crunched with great zest,

And computational quest,

Unleashing speed, a true wonder sublime!' , role = 'assistant' , function_call = None, tool_calls = None ) If you encounter a BadRequestError with an error message indicating that you are missing the messages or prompt field, you might inadvertently be using the wrong endpoint. For example, if you make a Completions request with a request body intended for Chat Completions, you get the following error: { "object" : "error" , "message" : "[{'type': 'missing', 'loc': ('body', 'prompt'), 'msg': 'Field required', ..." , "type" : "BadRequestError" , "param" : null, "code" : 400 } Conversely, if you make a Chat Completions request with a request body intended for Completions, you get the following error: { "object" : "error" , "message" : "[{'type': 'missing', 'loc': ('body', 'messages'), 'msg': 'Field required', ..." , "type" : "BadRequestError" , "param" : null, "code" : 400 } Verify that the endpoint you are using, such as /v1/completions or /v1/chat/completions , is correctly configured for your request.

Improving TRT-LLM Performance# TRT-LLM, which is the runtime for the model configurations listed as optimized in the Support Matrix, has a number of parameters you can tune to improve performance. Refer to Best Practices for Tuning the Performance of TensorRT-LLM for details.