Getting Started#

Launch NVIDIA NIM for LLMs# You can download and run the NIM of your choice from either the API catalog or the NGC. Option 1: From API Catalog# Checkout this video, which illustrates the following steps. Generate an API Key# Navigate to the LLM you wish to deploy from the API Catalog. Select “Docker” in the upper right pane. Select “Get API Key” and login if prompted. Select “Generate Key” Copy your key and store it in a secure place. Do not share it. Login to Docker# Use the docker login command, as shown in the following screenshot, to log in to Docker. Replace the placeholders for Username and Password with your values. Download and Launch NVIDIA NIM for LLMs# Use the following command to pull and run the NIM using Docker. To modify the docker run parameters, see Docker Run Parameters. Now, you can jump to running inference. Option 2: From NGC# Generate an API key# An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys. When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes. Export the API key# Pass the value of the API key to the docker run command in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM. If you’re not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal: export NGC_API_KEY = <value> Run one of the following commands to make the key available at startup: # If using bash echo "export NGC_API_KEY=<value>" >> ~/.bashrc # If using zsh echo "export NGC_API_KEY=<value>" >> ~/.zshrc Note Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_API_KEY_FILE , or using a password manager. Docker login to NGC# To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command: echo " $NGC_API_KEY " | docker login nvcr.io --username '$oauthtoken' --password-stdin Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password. List available NIMs# This documentation uses the ngc CLI tool in a number of examples. See the NGC CLI documentation for information on downloading and configure the tool. Use the following command to list the available NIMs, in CSV format. ngc registry image list --format_type csv nvcr.io/nim/meta/* This command should produce output in the following format: Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products <name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1> ... <nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN> Use the Repository and Latest Tag fields when you call the docker run command, as shown in the following section. Launch NIM# The following command launches a Docker container for the llama3-8b-instruct model. To launch a container for a different NIM, replace the values of Repository and Latest_Tag with values from the previous image list command and change the value of CONTAINER_NAME to something appropriate. You can tell that you have the correct Repository and Latest_Tag values by getting information about the model with the following command: ngc registry model info --format_type ascii ${ Repository } : ${ Latest_Tag } Which should produce output like the following: ---------------------------------------------------------- Model Version Information Id: 0.10.0+e6f46027-h100x1-fp16-balanced.24.06.15839955 Batch Size: Memory Footprint: Number Of Epochs: Accuracy Reached: GPU Model: Access Type: Associated Products: Created Date: 2024-06-14T22:28:17.604Z Description: Status: UPLOAD_COMPLETE Total File Count: 11 Total Size: 14.96 GB ---------------------------------------------------------- Note To deploy models that don’t fit on a single node, see Multi-node Deployments # Choose a container name for bookkeeping export CONTAINER_NAME = Llama3-8B-Instruct # The container name from the previous ngc registgry image list command Repository = nim/meta/llama3-8b-instruct Latest_Tag = 1 .0.0 # Choose a LLM NIM Image from NGC export IMG_NAME = "nvcr.io/ ${ Repository } : ${ Latest_Tag } " # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE = ~/.cache/nim mkdir -p " $LOCAL_NIM_CACHE " # Start the LLM NIM docker run -it --rm --name = $CONTAINER_NAME \ --runtime = nvidia \ --gpus all \ --shm-size = 16GB \ -e NGC_API_KEY = $NGC_API_KEY \ -v " $LOCAL_NIM_CACHE :/opt/nim/.cache" \ -u $( id -u ) \ -p 8000 :8000 \ $IMG_NAME

Docker Run Parameters# Flags Description -it --interactive + --tty (see Docker docs) --rm Delete the container after it stops (see Docker docs) --name=llama3-8b-instruct Give a name to the NIM container for bookkeeping (here llama3-8b-instruct ). Use any preferred value. --runtime=nvidia Ensure NVIDIA drivers are accessible in the container. --gpus all Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs. --shm-size=16GB Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled. -e NGC_API_KEY Provide the container with the token necessary to download adequate models and resources from NGC. See Export the API key. -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" Mount a cache directory from your system ( ~/.cache/nim here) inside the NIM (defaults to /opt/nim/.cache ), allowing downloaded models and artifacts to be reused by follow-up runs. -u $(id -u) Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory. -p 8000:8000 Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port ( 8000 here), while the right-hand side is the container port where the NIM server is published (defaults to 8000 ). $IMG_NAME Name and version of the LLM NIM container from NGC. The LLM NIM server automatically starts if no argument is provided after this. Note See the Configuring a NIM topic for information about additional configuration settings. Note If you have an issue with permission mismatches when downloading models in your local cache directory, add the -u $(id -u) option to the docker run call. Note NIM automatically selects the most suitable profile based on your system specification. For details, see Automatic Profile Selection

Run Inference# During startup the NIM container downloads the required resources and begins serving the model behind an API endpoint. The following message indicates a successful startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 ( Press CTRL+C to quit ) Once you see this message you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference: curl -X GET 'http://0.0.0.0:8000/v1/models' Tip Pipe the results of curl commands into a tool like jq or python -m json.tool to make the output of the API easier to read. For example: curl -s http://0.0.0.0:8000/v1/models | jq . This command should produce output similar to the following: { "object" : "list" , "data" : [ { "id" : "meta/llama3-8b-instruct" , "object" : "model" , "created" : 1715659875 , "owned_by" : "vllm" , "root" : "meta/llama3-8b-instruct" , "parent" : null, "permission" : [ { "id" : "modelperm-e39aaffe7015444eba964fa7736ae653" , "object" : "model_permission" , "created" : 1715659875 , "allow_create_engine" : false, "allow_sampling" : true, "allow_logprobs" : true, "allow_search_indices" : false, "allow_view" : true, "allow_fine_tuning" : false, "organization" : "*" , "group" : null, "is_blocking" : false } ] } ] } OpenAI Completion Request# The Completions endpoint is typically used for base models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true . Important Update the model name to suit your requirements. For example, for a llama3-8b-instruct model, you might use the following command: curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "prompt": "Once upon a time", "max_tokens": 64 }' You can also use the OpenAI Python API library. from openai import OpenAI client = OpenAI ( base_url = "http://0.0.0.0:8000/v1" , api_key = "not-used" ) prompt = "Once upon a time" response = client . completions . create ( model = "meta/llama3-8b-instruct" , prompt = prompt , max_tokens = 16 , stream = False ) completion = response . choices [ 0 ] . text print ( completion ) # Prints: # , there was a young man named Jack who lived in a small village at the OpenAI Chat Completion Request# The Chat Completions endpoint is typically used with chat or instruct tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true . Important Update model name according to your requirements. For example, for a llama3-8b-instruct model, you might use the following command: curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama3-8b-instruct", "messages": [ { "role":"user", "content":"Hello! How are you?" }, { "role":"assistant", "content":"Hi! I am quite well, how can I help you today?" }, { "role":"user", "content":"Can you write me a song?" } ], "max_tokens": 32 }' You can also use the OpenAI Python API library. from openai import OpenAI client = OpenAI ( base_url = "http://0.0.0.0:8000/v1" , api_key = "not-used" ) messages = [ { "role" : "user" , "content" : "Hello! How are you?" }, { "role" : "assistant" , "content" : "Hi! I am quite well, how can I help you today?" }, { "role" : "user" , "content" : "Write a short limerick about the wonders of GPU computing." } ] chat_response = client . chat . completions . create ( model = "meta/llama3-8b-instruct" , messages = messages , max_tokens = 32 , stream = False ) assistant_message = chat_response . choices [ 0 ] . message print ( assistant_message ) # Prints: # ChatCompletionMessage(content='There once was a GPU so fine,

Processed data in parallel so divine,

It crunched with great zest,

And computational quest,

Unleashing speed, a true wonder sublime!', role='assistant', function_call=None, tool_calls=None) Attention If you encounter a BadRequestError with an error message indicating that you are missing the messages or prompt field, you might inadvertently be using the wrong endpoint. For example, if you make a Completions request with a request body intended for Chat Completions, you get the following error: { "object" : "error" , "message" : "[{'type': 'missing', 'loc': ('body', 'prompt'), 'msg': 'Field required', ..." , "type" : "BadRequestError" , "param" : null, "code" : 400 } Conversely, if you make a Chat Completions request with a request body intended for Completions, you get the following error: { "object" : "error" , "message" : "[{'type': 'missing', 'loc': ('body', 'messages'), 'msg': 'Field required', ..." , "type" : "BadRequestError" , "param" : null, "code" : 400 } Verify that the endpoint you are using, such as /v1/completions or /v1/chat/completions , is correctly configured for your request. Parameter-Efficient Fine-Tuning# Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models. Currently NIM only supports LoRA PEFT. See Parameter-Efficient Fine-Tuning for details.

Stopping the container# If a Docker container is launched with the --name command line option, you can use the following command to stop the running container. docker stop $CONTAINER_NAME Use docker kill if stop is not responsive. Follow that command by docker rm $CONTAINER_NAME if you do not intend to restart the container as-is (with docker start $CONTAINER_NAME ), in which case you need to re-use the docker run ... instructions from the beginning of this section to start a new container for your NIM. If you did not start a container with --name , examine the output of the docker ps command to get a container ID for the given image you used.

Kubernetes Installation# The nim-deploy GitHub repository showcases several reference implementations for Kubernetes installations. These examples are experimental, and might require modification for you to run in your particular cluster set-up.