Large Language Models (Latest)
Large Language Models (Latest)

Getting Started

Setup

  1. NVIDIA GPU(s): NVIDIA NIM for LLMs (NIM for LLMs) runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. Homogeneous multi-GPUs systems with tensor parallelism enabled are also supported. See the Support Matrix for more information.

  2. CPU: x86_64 architecture only for this release

  3. OS: any Linux distributions which:

  4. CUDA Drivers: Follow the installation guide. We recommend:

    • Using a network repository as part of a package manager installation, skipping the CUDA toolkit installation as the libraries are available within the NIM container, then

    • Installing the open kernels for a specific version:

      Major Version

      EOL

      Data Center & RTX/Quadro GPUs

      GeForce GPUs

      > 550 TBD X X
      550 Feb. 2025 X X
      545 Oct. 2023 X X
      535 June 2026 X

      525 Nov. 2023 X

      470 Sept. 2024 X

  5. Install Docker

  6. Install the NVIDIA Container Toolkit

Note

After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.


To ensure that your setup is correct, run the following command:

Copy
Copied!
            

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to one of the following, where you can confirm CUDA driver version, and available GPUs.

Copy
Copied!
            

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 36C P0 112W / 700W | 78489MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

Installing WSL2 for Windows

Certain downloadable NIMs can be used on an RTX Windows system with Windows System for Linux (WSL). To enable WSL2, perform the following steps.

  1. Be sure your computer is capable of running WSL2 as described in the Prerequisites section of the WSL2 documentation.

  2. Enable WSL2 on your Windows computer by following the steps listed in Install WSL command. By default these steps install the Ubuntu distribution of Linux. For a list of alternative installations, see Change the default Linux distribution installed.

NGC Authentication

Generate an API key

An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.

When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.

personal-key.png

Export the API key

Pass the value of the API key to the docker run command in the next section as the NGC_CLI_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you’re not familiar with how to create the NGC_CLI_API_KEY environment variable, the simplest way is to export it in your terminal:

Copy
Copied!
            

export NGC_CLI_API_KEY=<value>

Run one of the following commands to make the key available at startup:

Copy
Copied!
            

# If using bash echo "export NGC_CLI_API_KEY=<value>" >> ~/.bashrc # If using zsh echo "export NGC_CLI_API_KEY=<value>" >> ~/.zshrc

Note

Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_CLI_API_KEY_FILE, or using a password manager.

Docker login to NGC

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

Copy
Copied!
            

echo "$NGC_CLI_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_CLI_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

NGC CLI Tool

This documentation uses the ngc CLI tool in a number of examples. See the NGC CLI documentation for information on downloading and configure the tool.

Use the following command to list the available NIMs, in CSV format.

Copy
Copied!
            

ngc registry image list --format_type csv nvcr.io/nim/meta/*

This command should produce output in the following format:

Copy
Copied!
            

Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products <name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1> ... <nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN>

Use the Repository and Latest Tag fields when you call the docker run command, as shown in the following section.

The following command launches a Docker container for the llama3-8b-instruct model. To launch a container for a different NIM, replace the values of Repository and Latest_Tag with values from the previous image list command and change the value of CONTAINER_NAME to something appropriate.

You can tell that you have the correct Repository and Latest_Tag values by getting information about the model with the following command:

Copy
Copied!
            

ngc registry model info --format_type ascii ${Repository}:${Latest_Tag}

Which should produce output like the following:

Copy
Copied!
            

---------------------------------------------------------- Model Version Information Id: 0.10.0+e6f46027-h100x1-fp16-balanced.24.06.15839955 Batch Size: Memory Footprint: Number Of Epochs: Accuracy Reached: GPU Model: Access Type: Associated Products: Created Date: 2024-06-14T22:28:17.604Z Description: Status: UPLOAD_COMPLETE Total File Count: 11 Total Size: 14.96 GB ----------------------------------------------------------

Note

To deploy models that don’t fit on a single node, see Multi-node Deployments

Copy
Copied!
            

# Choose a container name for bookkeeping export CONTAINER_NAME=Llama3-8B-Instruct # The container name from the previous ngc registgry image list command Repository=nim/meta/llama3-8b-instruct Latest_Tag=1.0.0 # Choose a LLM NIM Image from NGC export IMG_NAME="nvcr.io/${Repository}:${Latest_Tag}" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" # Start the LLM NIM docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY=$NGC_CLI_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME

Flags

Description

-it --interactive + --tty (see Docker docs)
--rm Delete the container after it stops (see Docker docs)
--name=llama3-8b-instruct Give a name to the NIM container for bookkeeping (here llama3-8b-instruct). Use any preferred value.
--runtime=nvidia Ensure NVIDIA drivers are accessible in the container.
--gpus all Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs.
--shm-size=16GB Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled.
-e NGC_CLI_API_KEY Provide the container with the token necessary to download adequate models and resources from NGC. See above.
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" Mount a cache directory from your system (~/.cache/nim here) inside the NIM (defaults to /opt/nim/.cache), allowing downloaded models and artifacts to be reused by follow-up runs.
-u $(id -u) Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.
-p 8000:8000 Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (8000 here), while the right-hand side is the container port where the NIM server is published (defaults to 8000).
$IMG_NAME Name and version of the LLM NIM container from NGC. The LLM NIM server automatically starts if no argument is provided after this.
Note

See the Configurating a NIM topic for information about additional configuration settings.

Note

If you have an issue with permission mismatches when downloading models in your local cache directory, add the -u $(id -u) option to the docker run call.


Note

NIM automatically selects the most suitable profile based on your system specification. For details, see Automatic Profile Selection

During startup the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.

Copy
Copied!
            

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once you see this message you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:

Copy
Copied!
            

curl -X GET 'http://0.0.0.0:8000/v1/models'

Tip

Pipe the results of curl commands into a tool like jq or python -m json.tool to make the output of the API easier to read. For example: curl -s http://0.0.0.0:8000/v1/models | jq.

This command should produce output similar to the following:

Copy
Copied!
            

{ "object": "list", "data": [ { "id": "meta/llama3-8b-instruct", "object": "model", "created": 1715659875, "owned_by": "vllm", "root": "meta/llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-e39aaffe7015444eba964fa7736ae653", "object": "model_permission", "created": 1715659875, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] } ] }

OpenAI Completion Request

The Completions endpoint is typically used for base models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true.

Important

Update the model name to suit your requirements. For example, for a llama3-70b-instruct model, you might use the following command:

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "prompt": "Once upon a time", "max_tokens": 64 }'

You can also use the OpenAI Python API library.

Copy
Copied!
            

from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used") prompt = "Once upon a time" response = client.completions.create( model="meta/llama3-8b-instruct", prompt=prompt, max_tokens=16, stream=False ) completion = response.choices[0].text print(completion) # Prints: # , there was a young man named Jack who lived in a small village at the

OpenAI Chat Completion Request

The Chat Completions endpoint is typically used with chat or instruct tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update model name according to your requirements. For example, for a llama3-70b-instruct model, you might use the following command:

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "messages": [ { "role":"user", "content":"Hello! How are you?" }, { "role":"assistant", "content":"Hi! I am quite well, how can I help you today?" }, { "role":"user", "content":"Can you write me a song?" } ], "max_tokens": 32 }'

You can also use the OpenAI Python API library.

Copy
Copied!
            

from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used") messages = [ {"role": "user", "content": "Hello! How are you?"}, {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"}, {"role": "user", "content": "Write a short limerick about the wonders of GPU computing."} ] chat_response = client.chat.completions.create( model="meta/llama3-8b-instruct", messages=messages, max_tokens=32, stream=False ) assistant_message = chat_response.choices[0].message print(assistant_message) # Prints: # ChatCompletionMessage(content='There once was a GPU so fine,\nProcessed data in parallel so divine,\nIt crunched with great zest,\nAnd computational quest,\nUnleashing speed, a true wonder sublime!', role='assistant', function_call=None, tool_calls=None)

Attention

If you encounter a BadRequestError with an error message indicating that you are missing the messages or prompt field, you might inadvertently be using the wrong endpoint.

For example, if you make a Completions request with a request body intended for Chat Completions, you get the following error:

Copy
Copied!
            

{ "object": "error", "message": "[{'type': 'missing', 'loc': ('body', 'prompt'), 'msg': 'Field required', ...", "type": "BadRequestError", "param": null, "code": 400 }

Conversely, if you make a Chat Completions request with a request body intended for Completions, you get the following error:

Copy
Copied!
            

{ "object": "error", "message": "[{'type': 'missing', 'loc': ('body', 'messages'), 'msg': 'Field required', ...", "type": "BadRequestError", "param": null, "code": 400 }

Verify that the endpoint you are using, such as /v1/completions or /v1/chat/completions, is correctly configured for your request.

Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models. Currently NIM only supports LoRA PEFT. See Parameter-Efficient Fine-Tuning for details.

If a Docker container is launched with the --name command line option, you can use the following command to stop the running container.

Copy
Copied!
            

docker stop $CONTAINER_NAME

Use docker kill if stop is not responsive. Follow that command by docker rm $CONTAINER_NAME if you do not intend to restart the container as-is (with docker start $CONTAINER_NAME), in which case you need to re-use the docker run ... instructions from the beginning of this section to start a new container for your NIM.

If you did not start a container with --name, examine the output of the docker ps command to get a container ID for the given image you used.

The nim-deploy showcases several reference implementations for Kubernetes installations. These examples are experimental, and might require modification for you to run in your particular cluster set-up.

NIM for LLMs provides utilities which enable downloading models to a local directory either as a model repository or to NIM cache. See the Utilities section for details.

From a NIM container you can view and download models locally.

  1. Use the following commands to launch a NIM container for the model you want to deploy.

Copy
Copied!
            

export NIM_IMAGE=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 export NIM_CACHE_PATH=/path/to/cache mkdir -p "$NIM_CACHE_PATH" docker run -it --rm --name=llama3-8b-instruct \ -e LOG_LEVEL=$LOG_LEVEL \ -v $NIM_CACHE_PATH:/opt/nim/.cache \ $NIM_IMAGE \ bash -i

  1. Use the list-model-profiles command to list the available profiles.

Copy
Copied!
            

list-model-profiles #SYSTEM INFO #- Free GPUs: # - [26b3:10de] (0) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 1%] # - [26b3:10de] (1) NVIDIA RTX 5880 Ada Generation (RTX A6000 Ada) [current utilization: 1%] # - [1d01:10de] (2) NVIDIA GeForce GT 1030 [current utilization: 2%] #MODEL PROFILES #- Compatible with system and runnable: # - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) # - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) # - With LoRA support: # - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora) # - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora) #- Incompatible with system: # - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency) # - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency) # - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput) # - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput) # - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency) # - e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency) # - 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency) # - 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency) # - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput) # - c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput) # - cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput) # - d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput) # - 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora) # - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora) # - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora) # - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)

You can download any of these profiles to the NIM cache using the download-to-cache command. The following example downloads the tensorrt_llm-l40s-fp8-tp1-throughput profile to the NIM cache.

Copy
Copied!
            

download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b

You can also let the download-to-cache decide the most optimal profile given the hardware to download by providing no profiles to download, as shown in the following example.

Copy
Copied!
            

download-to-cache

Further information on download-to-cache tool:

Copy
Copied!
            

download-to-cache -h # Downloads selected or default model profiles to NIM cache. Can be used to pre- # cache profiles prior to deployment. # options: # -h, --help show this help message and exit # --profiles [PROFILES ...], -p [PROFILES ...] # Profile hashes to download. If none are provided, the # optimal profile is downloaded. Multiple profiles can # be specified separated by spaces. # --all Set this to download all profiles to cache # --lora Set this to download default lora profile. This # expects --profiles and --all arguments are not # specified.

Once you have downloaded a profile, use the create-model-store command to create a repository for a single model, as shown in the following example.

Copy
Copied!
            

create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository

NIM supports serving models in an Air Gap system (also known as air wall, air-gapping or disconnected network). If NIM detects a previously loaded profile in the cache, it serves that profile from the cache. After downloading the profiles to cache using download-to-cache, the cache can be transferred to an air-gapped system to run a NIM without any internet connection or with no connection to the NGC registry.

Too see this in action, do not provide the NGC_API_KEY, as shown in the following example.

Copy
Copied!
            

# Choose a container name for bookkeeping export CONTAINER_NAME=Llama3-8B-Instruct # The container name from the previous ngc registgry image list command Repository=nim/meta/llama3-8b-instruct Latest_Tag=1.0.0 # Choose a LLM NIM Image from NGC export IMG_NAME="nvcr.io/${Repository}:${Latest_Tag}" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" export MODEL_REPO=/path/to/model-repository export NIM_SERVED_MODEL_NAME=my-model # Assuming the command run prior was `download-to-cache`, downloading the optimal profile docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME # Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b` docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME

Another option for the air gap route is to deploy the model repository created using the create-model-store command, as shown in the following example.

Copy
Copied!
            

# Choose a container name for bookkeeping export CONTAINER_NAME=Llama3-8B-Instruct # The container name from the previous ngc registgry image list command Repository=nim/meta/llama3-8b-instruct Latest_Tag=1.0.0 # Choose a LLM NIM Image from NGC export IMG_NAME="nvcr.io/${Repository}:${Latest_Tag}" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" export MODEL_REPO=/path/to/model-repository export NIM_SERVED_MODEL_NAME=my-model docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NIM_MODEL_NAME=/model-repo \ -e NIM_SERVED_MODEL_NAME \ -v $MODEL_REPO:/model-repo \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME

You can also run a docker container and transport a ready cache from a connected to an air gap system or a model repository.

Previous Release Notes
Next Multi-node Deployment
© Copyright © 2024, NVIDIA Corporation. Last updated on Jul 26, 2024.