Optimization
NVIDIA NIM for LLM automatically leverages model- and hardware-specific optimizations intended to improve the performance of large language models. The core metrics optimized for are:
Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
Inter-Token Latency (ITL): The latency between each token after the first.
Total Throughput: The total number of tokens generated per second by the NIM.
The NVIDIA TensorRT-LLM accelerated NIM backend provides support for optimized versions of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a generic backend is used instead.
The TensorRT-LLM NIM backend includes multiple optimization profiles, catered to either minimize latency or maximize throughput.
These engine profiles are tagged as called latency
and throughput
respectively in the model manifest included with the NIM.
While there are many differences between the throughput and latency variants, one of the most significant is that throughput variants utilize the minimum number of GPUs required to host a model (typically constrained by memory utilization), while latency variants use additional GPUs to decrease request latency at the cost of decreased total throughput per GPU relative to the throughput variant.
NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below:
Compatiblity Check: First, NIM filters out the profiles that are not runnable with the detected configuration based on the number and type of GPUs available.
Backend: This can be either TensorRT-LLM or vLLM. The optimized TensorRT-LLM profiles are preferred over vLLM when available.
Precision: Lower precision profiles are preferred when available. For example, NIM will automatically select
FP8
profiles overFP16
.Optimization Profile: Latency-optimized profiles are selected over throughput-optimized profiles by default.
Tensor Parallelism: Profiles with higher tensor parallelism values are preferred. For example, a profile that requires 8 GPUs to run will be selected over one which requires 4 GPUs.
This selection will be logged at startup. For example:
Detected 2 compatible profile(s).
Valid profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) on GPUs [0]
Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
Selected profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
Profile metadata: precision: fp16
Profile metadata: feat_lora: false
Profile metadata: gpu: A100
Profile metadata: gpu_device: 20b2:10de
Profile metadata: tp: 1
Profile metadata: llm_engine: tensorrt_llm
Profile metadata: pp: 1
Profile metadata: profile: throughput
Overriding Profile Selection
To override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>
. The following list-model-profiles
command lists the available profiles for the IMG_NAME
LLM NIM:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles
MODEL PROFILES
- Compatible with system and runnable:
- a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-A100-fp16-tp2-latency)
- 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
In the above example, you can set -e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput"
or -e NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
to run the A100 TP1 profile.