Utilities for NVIDIA NIM for LLMs#
NIM includes a set of utility scripts to assist with NIM operation.
Launch utilities by adding the utility name to the docker run
command.
For example, you can execute the list-model-profiles
utility with the following command:
docker run --rm --runtime=nvidia --gpus=all -e NGC_API_KEY=$NGC_API_KEY $IMG_NAME list-model-profiles
You can get more information about each utility with the -h
flag:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME download-to-cache -h
List Available Model Profiles for the Multi-LLM NIM#
- list-model-profiles
Prints to the console the system information detected by NIM, and the list of all profiles for the chosen NIM. Profiles are categorized by whether or not they are compatible with the current system, based on the system information detected, and filters the profile based on profile selection strategies dicussed in automatic profile selection.
Use this tool to verify whether your model is suitable for deployment. Refer to Troubleshooting for common failure cases and fixes.
Example#
export NIM_MODEL_NAME=<HF/NFC_or_local_path>
docker run -it --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY -e HF_TOKEN=$HF_TOKEN -e NIM_MODEL_NAME=$NIM_MODEL_NAME $IMG_NAME list-model-profiles
MODEL PROFILES
- Compatible with system and runnable:
- e2f00b2cbfb168f907c8d6d4d40406f7261111fbab8b3417a485dcd19d10cc98 (vllm)
- 668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c (tensorrt_llm)
- 50e138f94d85b97117e484660d13b6b54234e60c20584b1de6ed55d109ca4f21 (sglang)
- With LoRA support:
- 93c5e281d6616f45e2ef801abf4ed82fc65e38ec5f46e0664f340bad4f92d551 (vllm-lora)
- cdcd22d151713c8b91fcd279a4b5e021153e72ff5cf6ad5498aac96974f5b7d7 (tensorrt_llm-lora)
- Compilable to TRT-LLM using just-in-time compilation of HF models to TRTLLM engines: <None>
The filtered profiles are available for user to opt into.
List Available Model Profiles for LLM-Specific NIMs#
- list-model-profiles
Prints the system information detected by NIM and the list of profiles for the chosen NIM. Profiles are categorized by compatibility with the current system.
Example#
docker run -it --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY $IMG_NAME list-model-profiles
SYSTEM INFO
- Free GPUs:
- [20b2:10de] (0) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (1) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (2) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (3) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (4) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (5) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (6) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
- [20b2:10de] (7) NVIDIA A100-SXM4-80GB (A100 80GB) [current utilization: 0%]
MODEL PROFILES
- Compatible with system and runnable:
- d86754a6413430bf502ece62fdcc8137d4ed24d6062e93c23c1090f0623d535f (tensorrt_llm-a100-bf16-tp8-latency)
- 6f437946f8efbca34997428528d69b08974197de157460cbe36c34939dc99edb (tensorrt_llm-a100-bf16-tp4-throughput)
- 7283d5adcddeeab03996f61a33c51552d9bcff16c38e4a52f1204210caeb393c (vllm-fp16-tp8)
- cdcbc486dd076bc287cca6262c59fe90057d76ae18a407882075f65a99f5f038 (vllm-fp16-tp4)
- With LoRA support:
- 4cac7d500b9ed35bc51cb7845e637288c682f4a644f0b4e6a4f71d3b8b188101 (tensorrt_llm-a100-bf16-tp4-throughput-lora)
- 7096ab12e70abc4ac0e125a90a8e40b296891603fad45d2b208d655ac1dea9d8 (vllm-fp16-tp8-lora)
- d4bc4be4167c103b45d9375c9a907c11339f59235dfc5de321a9e13d8132aba6 (vllm-fp16-tp4-lora)
- Incompatible with system:
- 5296eed82c6309b64b13da03fbb843d99c3276effd6a0c51e28ad5bb29f56017 (tensorrt_llm-h100-fp8-tp8-latency)
- 4e0aeeefd4dfeae46ad40f16238bbde8858850ce0cf56c26449f447a02a9ac8f (tensorrt_llm-h100-fp8-tp4-throughput)
- ...
Download Model Profiles to the NIM Cache#
- download-to-cache
Downloads selected or default model profiles to the NIM cache. Use this to precache profiles before deployment. Requires
NGC_API_KEY
in the environment.- --profiles [PROFILES ...], -p [PROFILES ...]#
Profile hashes to download. If none are provided, the optimal profile is downloaded. Multiple profiles can be specified separated by spaces.
- --all#
Set to download all profiles to cache
- --lora#
Set this to download the default LoRA profile. You cannot use this option with either the
--profiles
or the--all
option.
Example#
docker run -it --rm --gpus all -e NGC_API_KEY -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
$IMG_NAME download-to-cache -p 6f437946f8efbca34997428528d69b08974197de157460cbe36c34939dc99edb
INFO 08-12 18:44:07.810 pre_download.py:80] Fetching contents for profile 6f437946f8efbca34997428528d69b08974197de157460cbe36c34939dc99edb
INFO 08-12 18:44:07.810 pre_download.py:86] {
"feat_lora": "false",
"gpu": "A100",
"gpu_device": "20b2:10de",
"llm_engine": "tensorrt_llm",
"pp": "1",
"precision": "bf16",
"profile": "throughput",
"tp": "4"
}
...
Create a Model Store#
- create-model-store
Extracts files from a cached model profile and creates a properly formatted directory. If the profile is not already cached, the command downloads it to the model cache. Downloading requires
NGC_API_KEY
in the environment.- --profile <PROFILE>, -p <PROFILE>#
Profile hash to create a model directory of. Will be downloaded if not present.
- --model-store <MODEL_STORE>, -m <MODEL_STORE>#
The directory path where the model
--profile
is extracted and copied.
Example#
docker run -it --rm --gpus all -e NGC_API_KEY -v $LOCAL_NIM_CACHE:/opt/nim/.cache $IMG_NAME create-model-store -p 6f437946f8efbca34997428528d69b08974197de157460cbe36c34939dc99edb -m /tmp
INFO 08-12 19:49:47.629 pre_download.py:128] Fetching contents for profile 6f437946f8efbca34997428528d69b08974197de157460cbe36c34939dc99edb
INFO 08-12 19:49:47.629 pre_download.py:135] Copying contents for profile 6f437946f8efbca34997428528d69b08974197de157460cbe36c34939dc99edb to /tmp
Check the NIM Cache#
- nim-llm-check-cache-env
Checks if the NIM cache directory is present and can be written to.
Example#
docker run -it --rm --gpus all -v /bad_path:/opt/nim/.cache $IMG_NAME nim-llm-check-cache-env
WARNING 08-12 19:54:06.347 caches.py:30] /opt/nim/.cache is read-only, application may fail if model is not already present in cache
Set Cache Environment Variables#
- nim-llm-set-cache-env
Prints commands for setting cache environment variables to console.
Example#
docker run -it --rm --gpus all -v $LOCAL_NIM_CACHE:/opt/nim/.cache $IMG_NAME nim-llm-set-cache-env
export NUMBA_CACHE_DIR=/tmp/numba
export NGC_HOME=/opt/nim/.cache/ngc
export HF_HOME=/opt/nim/.cache/huggingface
export VLLM_CONFIG_ROOT=/opt/nim/.cache/vllm/config
export VLLM_CACHE_ROOT=/opt/nim/.cache/vllm/cache