NIM CLI
The NIM CLI is a downloadable binary that makes it easy to experiment with NVIDIA NIM for LLMs.
This is in an early release state. Some known limitations:
Does not work on Ubuntu < 20
Has not been tested thoroughly on Ubuntu < 22
Cannot use
chat
subcommand with LoRAsIs currently hard wired to the 24.05-rc3
Can only serve traffic on port 8000 on the host machine
You can download the CLI from the nvidian/nim-llm-dev
via the browser or this NGC CLI command:
ngc registry resource download-version nvidian/nim-llm-dev/nim-cli
It is a single binary file built for x86 architecture. You will need to chmod +x nim
to make it executable after download and add it to your path.
If you do not have the ngc
CLI tool, refer to the NGC CLI documentation
for information on downloading and configure the tool.
Inspect the commands available using the -h
help flag.
nim -h
Output:
NVIDIA LLM Inference Command Line Inference (CLI)
Usage: nim <COMMAND>
Commands:
list List models (see subcommands)
run Run a NIM
status Check the status of running NIMs
logs Tail the logs for a running NIM
stop Stop a NIM
benchmark Run benchmarks against a running NIM
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
The command nim list all
can be used to list the NIMs available to run.
nim list all
Output:
+--------------------------+
| Model |
+--------------------------+
| meta/llama3-70b-instruct |
+--------------------------+
| meta/llama3-8b-instruct |
+--------------------------+
To run a NIM, use the command nim run <model-id>
. The following NIM CLI command downloads the meta-llama3-8b-instruct
model if it isn’t already downloaded and prepares it so you can run inferences against it.
nim run meta-llama3-8b-instruct
Output:
[NIM] Docker Run Command:
#!/bin/bash
docker run --gpus='"device=1"' --rm -ti --label name=nim__meta-llama3-8b-instruct__port-8000 \
-v ~/.cache/ngc:/home/nvs/.cache \
-e "MODEL_NAME=meta-llama3-8b-instruct" \
-e NGC_API_KEY=$NGC_API_KEY \
-p "8000:8000" \
nvcr.io/nvidian/nim-llm-dev/meta-llama3-8b-instruct:24.05.rc3
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 24.05
Model: /model-store/meta-llama3-8b-instruct
meta-llama3-8b-instruct
Downloading model from NGC...
Checking meta-llama3-8b-instruct versions...
Downloading hf version of meta-llama3-8b-instruct...
...
... <logging ommitted>
...
INFO: Started server process [115]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 05-14 01:14:20 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
Our NIM is now running! You can check the models available deployed in that NIM by sending a GET request to the v1/models
endpoint:
curl -X GET localhost:8000/v1/models
Output:
{
"object": "list",
"data": [
{
"id": "meta-llama3-8b-instruct",
"object": "model",
"created": 1715649304,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-2a8eb3aa92fb42f0b22525f2faec0b23",
"object": "model_permission",
"created": 1715649304,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
Running a NIM w/ LoRAs
First download some meta-llama3-8b-instruct
LoRAs to ~/loras
on your localhost:
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY && cd $LOCAL_PEFT_DIRECTORY
# downloading .nemo loras
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-squad-v1"
# downloading HF loras
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-math-v1"
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-squad-v1"
chmod -R 777 $LOCAL_PEFT_DIRECTORY
Then run the NIM:
nim run meta-llama3-8b-instruct --peft-path $LOCAL_PEFT_DIRECTORY
Note that as of 24.05.rc3, the LoRAs in the directory passed to --peft-path
must be for the base model. If you have LoRAs for other base models, you will encounter errors.
One can send a Completion request to a deployed NIM:
nim completion --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct
Output:
"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?"
Enable streaming using the --streaming
flag.
nim completion --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct --streaming
One can chat with a NIM by sending a Chat Completion request with:
nim chat --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct
Output:
"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?"
Enable streaming using the --streaming
flag.
nim chat --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct --streaming
The status of currently running NIMs can be checked with:
nim status
Output:
+--------------------------+-------------------+------------------------+
| NIM NAME | STATUS | PORTS |
+--------------------------+-------------------+------------------------+
| /meta-llama3-8b-instruct | Up About a minute | 8000->8000,8000->8000, |
+--------------------------+-------------------+------------------------+
The logs of a running NIM can be checked by using nim logs <model-id>
. To inspect the logs of the NIM that was launched in the previous section, run the following command:
nim logs meta-llama3-8b-instruct
nim benchmark <model-id>
will run a benchmarking against a deployed NIM using the GenAI-Perf Analyzer tool from Triton Inference Server SDK. To run a benchmarking against the NIM that was launched in the previous section, run the following command:
nim benchmark meta-llama3-8b-instruct
Output:
Directories created successfully.
=================================
== Triton Inference Server SDK ==
=================================
NVIDIA Release 24.04 (build 90085241)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
genai-perf - INFO - Running Perf Analyzer : 'perf_analyzer -m nvidian/nim-llm-dev/meta-llama3-8b-instruct@hf --async --input-data llm_inputs.json --endpoint v1/chat/completions --service-kind openai -u http://localhost:8000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file profile_export.json --concurrency-range 5 -i http'