Model Profiles#
A NIM Model profile defines two things – what model engines NIM can use, and what criteria NIM should use to choose those engines. Unique strings based on a hash of the profile contents identify each profile.
NVIDIA provides optimized model profiles for popular data-center GPU models, different GPU counts, and specific numeric precisions. The optimized model profiles have model-specific and hardware-specific optimizations to improve the performance of the model. NIM for LLMs downloads pre-compiled TensorRT-LLM engines for optimized profiles. There may be several different valid optimized model profiles for a given system, in which case you can select a profile at deployment time by following the steps in Profile Selection. If the user does not manually select a profile at deployment time, NIM will choose a profile automatically according to the rules laid out in Automatic Profile Selection.
NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build
or vllm
in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.
Model profiles are embedded within the NIM container in a Model Manifest file, which is by default placed at /opt/nim/etc/default/model_manifest.yaml
within the container filesystem.
Profile Selection#
To select a profile for deployment, set a specific profile ID with -e NIM_MODEL_PROFILE=ID
, where ID
is a profile ID returned by the list-model-profiles utility, as shown in the following example:
# $IMG_NAME must be a valid NIM container profile ID,
# such as
#751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c
# in the following example
docker run --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY $IMG_NAME list-model-profiles
Output:
MODEL PROFILES
- Compatible with system and runnable:
- a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-A100-fp16-tp2-latency)
- 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
To select , you can set -e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput"
or -e NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
to run the A100 TP1 profile.
docker run -it --rm --name=$CONTAINER_NAME \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Automatic Profile Selection#
If no profile is explicitly chosen, NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below:
Custom Profiles: Before anything, if there exists a cached Fine-Tuned custom profile, it will be selected.
Optimized Profiles: If any optimized profiles exist for the detected GPUs, one will be chosen according to the following:
Precision: Lower precision profiles are preferred when available. For example, NIM will automatically select
FP8
profiles overFP16
. Floating point precisions are also preferred over other numeric types for the same number of bits. For example,FP8
is preferred overINT8
andFP16
is preferred overBF16
. See Quantization for more details.Optimization Target: Latency-optimized profiles are selected over throughput-optimized profiles by default.
Tensor Parallelism: Profiles with higher tensor parallelism (TP) values are preferred. For example, a profile with a TP value of 8 (which requires 8 GPUs to run) will be selected over one with a TP value of 4, assuming enough GPUs are available to run both.
Generic Profiles: If no optimized profiles are compatible, a generic profile will be chosen based on the following criteria:
Backend: NIM will default to using a
tensorrt_llm-local_build
profile over avllm
profile.Precision: Lower precision profiles are preferred when available. See details under
Optimized Profiles
.Tensor Parallelism: Profiles with higher tensor parallelism (TP) values are preferred. See details under
Optimized Profiles
.
This selection will be logged at startup. For example:
Detected 2 compatible profile(s).
Valid profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) on GPUs [0]
Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
Selected profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
Profile metadata: precision: fp16
Profile metadata: feat_lora: false
Profile metadata: gpu: A100
Profile metadata: gpu_device: 20b2:10de
Profile metadata: tp: 1
Profile metadata: llm_engine: tensorrt_llm
Profile metadata: pp: 1
Profile metadata: profile: throughput
Profile Details#
Optimized Profiles vs. Local Build Profiles#
Both optimized and generic local build profiles use TensorRT-LLM, but they differ in a few ways.
Optimized profiles leverage GPU specific TensorRT-LLM options in order to achieve optimal throughput and latency, and using an optimized profile causes NIM to download a pre-compiled TensorRT-LLM engine.
Local build profiles use heuristics to choose a balanced set of options, and using a local build profile causes NIM to download the raw model weights and perform compilation on the local system. This can lead to longer startup times when deploying with a local build profile that has not been previously deployed and cached.
Optimization Targets#
optimized
profiles can have different optimization targets, catered to either minimize latency or maximize throughput.
These engine profiles are tagged as latency
and throughput
respectively in the profile name and the model manifest file included with the NIM.
latency
profiles are designed to minimize:
Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
Inter-Token Latency (ITL): The latency between each token after the first.
throughput
profiles are designed to maximize:
Total Throughput per GPU: The total number of tokens generated per second by the NIM, divided by the number of GPUs used.
While there can be many differences between the throughput and latency variants to meet these different criteria, one of the most significant is that throughput variants utilize the minimum number of GPUs required to host a model (typically constrained by memory utilization). Latency variants use additional GPUs to decrease request latency at the cost of decreased total throughput per GPU relative to the throughput variant.
Quantization#
For some models and GPU configurations, quantized engines with reduced numerical precision are available.
These can be identified by the numeric format included in the profile name. For example, fp8
for models that have been quantized to 8-bit floating point values as opposed to fp16
for non-quantized 16-bit floating point values.
All quantized engines are rigorously tested to meet the same accuracy criteria as the default fp16
engines.
Because of this accuracy testing, and because quantization leads to reduced memory requirements and therefore significant improvements in both latency and throughput, fp8
models are chosen by default where available.
Quantized profiles fall into three main categories: optimized
, trtllm-buildable
, and generic
similar to the full precision profiles. The trtllm-buildable
profiles with quantization tag enable on-the-fly quantization of fp16
or bf16
models during the engine build process. This step may temporarily require extra GPU memory to hold the original fp16
weights and perform necessary calibration. However, the resulting quantized engine reduces GPU memory usage during inference due to its lower precision. Importantly, once the quantized engine is cached after the first launch of the inference server, subsequent launches will only need enough GPU memory for inference, eliminating the need for the additional memory used during the initial quantization. NIM currently supports fp8 quantization of HF and Nemotron models.
For the calibration step in trtllm-buildable
profiles, users need to download cnn_dailymail
dataset or any custom dataset for better accuracy. We recommend using cnn_dailymail
for the models we support currently.
Note
The user is responsible for checking whether the dataset license is fit for the intended purpose.
Run
git clone https://huggingface.co/datasets/abisee/cnn_dailymail
Mount the dataset and launch NIM with the NIM_QUANTIZATION_CALIBRATION_DATASET set,
Note that the mounted path in the container should contain the string cnn_dailymail
in it. Note that the cnn_dailymail
repository root folder name should not be changed.
docker run ...
-v /path/to/downloaded/cnn_dailymail:/datasets/cnn_dailymail
-e NIM_QUANTIZATION_CALIBRATION_DATASET=/datasets/cnn_dailymail
...
To deploy non-quantized fp16
engines, follow the Profile Selection steps.