Model Profiles
A NIM Model profile defines two things – what model engines NIM can use, and what criteria NIM should use to choose those engines. Unique strings based on a hash of the profile contents identify each profile.
Users may select a profile at deployment time by following the Profile Selection steps. If the user does not manually select a profile at deployment time, NIM will choose a profile automatically according to the rules laid out in Automatic Profile Selection. To understand how profiles and their corresponding engines are created, see How Profiles are Created.
Model profiles are embedded within the NIM container in a Model Manifest file, which is by default placed at /etc/nim/config/model_manifest.yaml
within the container filesystem.
To select a profile for deployment, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>
.
You can find the valid profile IDs by using the list-model-profiles
utility, as shown in the following example:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles
MODEL PROFILES
- Compatible with system and runnable:
- a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-A100-fp16-tp2-latency)
- 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
To select , you can set -e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput"
or -e NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
to run the A100 TP1 profile.
NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below:
Compatibility Check: First, NIM filters out the profiles that are not runnable with the detected configuration based on the number and type of GPUs available.
Backend: This can be either TensorRT-LLM or vLLM. The optimized TensorRT-LLM profiles are preferred over vLLM when available.
Precision: Lower precision profiles are preferred when available. For example, NIM will automatically select
FP8
profiles overFP16
. See Quantization for more details.Optimization Profile: Latency-optimized profiles are selected over throughput-optimized profiles by default.
Tensor Parallelism: Profiles with higher tensor parallelism values are preferred. For example, a profile that requires 8 GPUs to run will be selected over one which requires 4 GPUs.
This selection will be logged at startup. For example:
Detected 2 compatible profile(s).
Valid profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) on GPUs [0]
Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
Selected profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
Profile metadata: precision: fp16
Profile metadata: feat_lora: false
Profile metadata: gpu: A100
Profile metadata: gpu_device: 20b2:10de
Profile metadata: tp: 1
Profile metadata: llm_engine: tensorrt_llm
Profile metadata: pp: 1
Profile metadata: profile: throughput
NIM microservices have two main categories of profiles: optimized
and generic
. optimized
profiles are created for a subset of GPUs and models, and leverage model- and hardware-specific optimizations intended to improve the performance of large language models.
Over time, the breadth of models and GPUs for which optimized
engines exist will increase. However, if an optimized engine does not exist for a particular combination of model and GPU configuration, a generic
backend is used as a fallback.
Currently, optimized
profiles leverage pre-compiled TensorRT-LLM engines, while generic
profiles utilize vLLM.
Optimization Targets
optimized
profiles can have different optimization targets, catered to either minimize latency or maximize throughput.
These engine profiles are tagged as latency
and throughput
respectively in the model manifest included with the NIM.
latency
profiles are designed to minimize:
Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
Inter-Token Latency (ITL): The latency between each token after the first.
throughput
profiles are designed to maximize:
Total Throughput per GPU: The total number of tokens generated per second by the NIM, divided by the number of GPUs used.
While there can be many differences between the throughput and latency variants to meet these different criteria, one of the most significant is that throughput variants utilize the minimum number of GPUs required to host a model (typically constrained by memory utilization). Latency variants use additional GPUs to decrease request latency at the cost of decreased total throughput per GPU relative to the throughput variant.
Quantization
For some models and GPU configurations, quantized engines with reduced numerical precision are available.
These can be identified by the numeric format included in the profile name. For example, fp8
for models that have been quantized to 8-bit floating point values as opposed to fp16
for non-quantized 16-bit floating point values.
All quantized engines are rigorously tested to meet the same accuracy criteria as the default fp16
engines.
Because of this accuracy testing, and because quantization leads to reduced memory requirements and therefore significant improvements in both latency and throughput, fp8
models are chosen by default where available.
To deploy non-quantized fp16
engines, follow the Profile Selection steps.