Large Language Models (RC15)

Getting Started

Setup

  1. NVIDIA GPU(s): NVIDIA NIM for LLMs runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. Homogenous multi-GPUs systems with tensor parallelism enabled are also supported. See the Support Matrix for more information.

  2. CPU: x86_64 architecture only for this release

  3. OS: any Linux distros which:

  4. CUDA Drivers: Follow the installation guide. We recommend:

    • Using a network repository as part of a package manager installation, skipping the cuda toolkit installation as the libraries will be available within the NIM container, then

    • Installing the open kernels for a specific version:

      Major Version

      EOL

      Data Center & RTX/Quadro GPUs

      GeForce GPUs

      > 550 TBD X X
      550 Feb. 2025 X X
      545 Oct. 2023 X X
      535 June 2026 X

      525 Nov. 2023 X

      470 Sept. 2024 X

  5. Install Docker

  6. Install the NVIDIA Container Toolkit

Note

After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.


To ensure things are working, run the following command:

Copy
Copied!
            

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This will produce an output similar to the one below for your own system, where you can confirm CUDA driver version, and available GPUs:

Copy
Copied!
            

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 36C P0 112W / 700W | 78489MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

NGC Authentication

Generate an API key

An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.

When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.

personal-key.png

Export the API key

This key will need to be passed to docker run in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you’re not familiar with how to do this, the simplest way is to export it in your terminal:

Copy
Copied!
            

export NGC_API_KEY=<value>

Run one of the following to make it available at startup:

Copy
Copied!
            

# If using bash echo "export NGC_API_KEY=<value>" >> ~/.bashrc # If using zsh echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Note

Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_API_KEY_FILE, or using a password manager.

Docker login to NGC

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

Copy
Copied!
            

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

NGC CLI Tool

A few times throughout this documentation, the ngc CLI tool will be used. Before continuing, please refer to the NGC CLI documentation for information on how to download and configure the tool.

Note that the ngc tool used to use the environment variable NGC_API_KEY but has deprecated that in favor of NGC_CLI_API_KEY. In the previous section, you set NGC_API_KEY and it will be used in future commands. If you run ngc with this variable set, you will get a warning saying it is deprecated in favor of NGC_CLI_API_KEY. This can be safely ignored for now. You can set NGC_CLI_API_KEY, but so long as NGC_API_KEY is set, you will still get the warning.

The below command will launch a Docker container for the meta/llama3-8b-instruct model.

Copy
Copied!
            

# Choose a container name for bookkeeping export CONTAINER_NAME=meta-llama3-8b-instruct # Choose a LLM NIM Image from NGC export IMG_NAME="nvcr.io/nvidian/nim-llm-dev/${CONTAINER_NAME}:24.05.rc15" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" # Start the LLM NIM docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME

Flags

Description

-it --interactive + --tty (see Docker docs)
--rm Delete the container after it stops (see Docker docs)
--name=meta-llama3-8b-instruct Give a name to the NIM container for bookkeeping (here meta-llama3-8b-instruct). Use any preferred value.
--runtime=nvidia Ensure NVIDIA drivers are accessible in the container.
--gpus all Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs.
--shm-size=16GB Allocate host memory for multi-gpu communication. Not required for single GPU models or GPUs with NVLink enabled.
-e NGC_API_KEY Provide the container with the token necessary to download adequate models and resources from NGC. See above.
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" Mount a cache directory from your system (~/.cache/nim here) inside the NIM (defaults to /opt/nim/.cache), allowing downloaded models and artifacts to be reused by follow-up runs.
-u $(id -u) Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.
-p 8000:8000 Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (8000 here), while the right-hand side is the container port where the NIM server is published (defaults to 8000).
$IMG_NAME Name and version of the LLM NIM container from NGC. The LLM NIM server will start automatically if no argument is provided after this.
Note

See the Configurating a NIM topic for information about additional configuration settings.


Note

NIM will automatically select the most suitable profile based on your system specification. For more details, see the section on Automatic Profile Selection

During startup the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup.

Copy
Copied!
            

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once you see this message you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:

Copy
Copied!
            

curl -X GET 'http://0.0.0.0:8000/v1/models'

Tip

Pipe the results of curl commands into a tool like jq or python -m json.tool to make the output of the API easier to read. For example: curl -s http://0.0.0.0:8000/v1/models | jq.

This command should produce output similar to the following:

Copy
Copied!
            

{ "object": "list", "data": [ { "id": "meta-llama3-8b-instruct", "object": "model", "created": 1715659875, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-e39aaffe7015444eba964fa7736ae653", "object": "model_permission", "created": 1715659875, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] } ] }

OpenAI Completion Request

The Completions endpoint is generally used for base models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true.

Important

Update the model name to suit your requirements. For example, for a meta/llama3-70b-instruct model, you might use the following command:

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "prompt": "Once upon a time", "max_tokens": 64 }'

You can also use the OpenAI Python API library

Copy
Copied!
            

from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used") prompt = "Once upon a time" response = client.completions.create( model="meta/llama3-8b-instruct", prompt=prompt, max_tokens=16, stream=False ) completion = response.choices[0].text print(completion) # Prints: # , there was a young man named Jack who lived in a small village at the

OpenAI Chat Completion Request

The Chat Completions endpoint is typically used with chat or instruct tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update model name according to your requirements. For example, for a meta/llama3-70b-instruct model, you might use the following command:

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "messages": [ { "role":"user", "content":"Hello! How are you?" }, { "role":"assistant", "content":"Hi! I am quite well, how can I help you today?" }, { "role":"user", "content":"Can you write me a song?" } ], "max_tokens": 32 }'

You can also use the OpenAI Python API library

Copy
Copied!
            

from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used") messages = [ {"role": "user", "content": "Hello! How are you?"}, {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"}, {"role": "user", "content": "Write a short limerick about the wonders of GPU computing."} ] chat_response = client.chat.completions.create( model="meta/llama3-8b-instruct", messages=messages, max_tokens=32, stream=False ) assistant_message = chat_response.choices[0].message print(assistant_message) # Prints: # ChatCompletionMessage(content='There once was a GPU so fine,\nProcessed data in parallel so divine,\nIt crunched with great zest,\nAnd computational quest,\nUnleashing speed, a true wonder sublime!', role='assistant', function_call=None, tool_calls=None)

Attention

If you encounter a BadRequestError with an error message indicating that you are missing the messages or prompt field, you might inadvertently be using the wrong endpoint.

For instance, if you make a Completions request with a request body intended for Chat Completions, you will see the following error:

Copy
Copied!
            

{ "object": "error", "message": "[{'type': 'missing', 'loc': ('body', 'prompt'), 'msg': 'Field required', ...", "type": "BadRequestError", "param": null, "code": 400 }

Conversely, if you make a Chat Completions request with a request body intended for Completions, you will see the following error:

Copy
Copied!
            

{ "object": "error", "message": "[{'type': 'missing', 'loc': ('body', 'messages'), 'msg': 'Field required', ...", "type": "BadRequestError", "param": null, "code": 400 }

Verify that the endpoint you are using (i.e /v1/completions or /v1/chat/completions) is correctly configured for your request.

Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models. Currently NIM only supports LoRA PEFT. See Parameter-Efficient Fine-Tuning for details.

If a Docker container is launched with the --name command line option, you can use the following command to stop the running container.

Copy
Copied!
            

# In the previous sections, the environment variable CONTAINER_NAME was # defined using `export CONTAINER_NAME=meta-llama3-8b-instruct` docker stop $CONTAINER_NAME

Use docker kill if stop is not responsive. Follow either by docker rm $CONTAINER_NAME if you do not intend to restart this container as-is (with docker start $CONTAINER_NAME), in which case you will need to re-use the docker run ... instructions from the top of this section to start a new container for your NIM.

If you did not start a container with --name, look at the output of docker ps to get a container ID for the given image you used.

Previous Release Notes
Next Configuring a NIM
© Copyright © 2024, NVIDIA Corporation. Last updated on May 30, 2024.