Model Profiles in NVIDIA NIM for LLMs#
A NIM model profile defines two things – what runtime the NIM uses, and what criteria NIM should use to choose those engines. Unique strings based on a hash of the profile contents identify each profile.
NVIDIA provides optimized model profiles for popular data-center GPU models, different GPU counts, and specific numeric precisions. For LLM-specific NIMs with NVIDIA validated models, the optimized model profiles have model-specific and hardware-specific optimizations to improve the performance of the model. LLM-specific NIMs download pre-compiled TensorRT-LLM engines for optimized profiles. There may be several different valid optimized model profiles for a given system, in which case you can select a profile at deployment time by following the steps in Profile Selection. If you do not manually select a profile at deployment time, NIM will choose a profile automatically according to the rules laid out in Automatic Profile Selection.
NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build
, vllm
or sglang
in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.
Model profiles are embedded within the NIM container in a Model Manifest file, which is by default placed at /opt/nim/etc/default/model_manifest.yaml
within the container filesystem.
Important
LLM-specific NIM containers provide profiles that validate accuracy and performance characteristics across different hardware configurations (model/GPU combinations). The LLM-agnostic NIM container provides a set of generic runtime options for profile selection, although, not all runtimes are supported by all model architectures.
Listing Profiles#
You can list the profiles using the list-model-profiles utility, as shown in the following example:
# $IMG_NAME must be a valid NIM container profile ID,
# such as
#751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c
# in the following example
docker run --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY $IMG_NAME list-model-profiles
Output:
MODEL PROFILES
- Compatible with system and runnable:
- a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-A100-fp16-tp2-latency)
- 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
Profile Selection#
To select a profile for deployment, set a specific profile ID with -e NIM_MODEL_PROFILE=ID
, where ID
is a profile ID returned by the list-model-profiles utility, as shown in the previous example.
For example, you can set -e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput"
or -e NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
to run the A100 TP1 profile.
docker run -it --rm --name=$CONTAINER_NAME \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
docker run --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY $IMG_NAME list-model-profiles --model <path to local model or HF link>
Automatic Profile Selection#
For LLM-specific NIMs, the profile sorting logic is based on the parameters involved. For LLM-agnostic NIMs, automatic profile selection is determined based on what engines will be supported and is subject to change. If you prefer not to rely on automatic profile selection, you can manually select a profile using the following steps:
Run list-model-profiles for the model of interest by passing the
--model $NIM_MODEL_NAME
flag. This will list all supported profiles for that model.Benchmark the listed profiles and evaluate them based on key metrics like Time to First Token (TTFT), Inter-Token Latency (ITL), and Total Throughput per GPU.
Select the profile which delivers the best performance based on the metric most relevant to your use case.
If no profile is explicitly chosen, NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process.
Generic Profiles: If LLM NIMs are deployed or no optimized profiles are compatible in LLM-specific NIMs, a generic profile will be chosen based on the following criteria:
Backend: NIM will default to using a compatible profile following this order:
vllm
>tensorrt_llm
>sglang
engine type.Precision: Lower precision profiles are preferred when available. See details under
Optimized Profiles
. Only relevant for LLM-specific NIMs.Tensor Parallelism: Profiles with higher tensor parallelism (TP) values are preferred. See details under
Optimized Profiles
. Only relevant for LLM-specific NIMs.
Custom Profiles: Before anything, if there exists a cached Fine-Tuned custom profile, it will be selected.
Optimized Profiles: If any optimized profiles exist for the detected GPUs, one will be chosen according to the following:
Precision: Lower precision profiles are preferred when available. For example, NIM automatically selects
FP8
profiles overFP16
. Floating-point precisions are also preferred over other numeric types for the same number of bits. For example,FP8
is preferred overINT8
, andFP16
is preferred overBF16
. For more information, refer to Quantization.Optimization Target: Latency-optimized profiles are selected over throughput-optimized profiles by default.
Tensor Parallelism: Profiles with higher tensor parallelism (TP) values are preferred. For example, a profile with a TP value of 8 (which requires 8 GPUs to run) will be selected over one with a TP value of 4, assuming enough GPUs are available to run both.
This selection will be logged at startup. For example:
Detected 2 compatible profile(s).
Valid profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) on GPUs [0]
Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
Selected profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput)
Profile metadata: precision: fp16
Profile metadata: feat_lora: false
Profile metadata: gpu: A100
Profile metadata: gpu_device: 20b2:10de
Profile metadata: tp: 1
Profile metadata: llm_engine: tensorrt_llm
Profile metadata: pp: 1
Profile metadata: profile: throughput
Automatic Profile Selection for the LLM-agnostic NIM#
For the LLM-agnostic NIM container, if no profile is explicitly chosen, LLM NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware and the specified model. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below:
Profiles: A profile will be chosen based on the following criteria:
This selection will be logged at startup. For example:
Detected 2 compatible profile(s).
Valid profile: tensorrt_llm (tensorrt_llm) on GPUs [0]
Valid profile: vllm (vllm) on GPUs [0]
Valid profile: sglang (sglang) on GPUs [0]
Selected profile: tensorrt_llm (tensorrt_llm)
Profile metadata: llm_engine: tensorrt_llm
Model Architecture#
Model architectures are extracted from config.json
files in each of the supported input formats in model format.
Example of a HuggingFace LLama3.1-8b model configuration:
"activation_function": "gelu",
"architectures":[LlamaForCausalLM],
"attention_dropout": 0.1,
"residual_dropout": 0.1,
"embedding_dropout": 0.1,
Example of a TRTLLM configuration:
"mapping": {"pp_size": 2, "tp_size": 2, "world_size": 4},
"num_key_value_heads": 2,
"head_size": 64,
"architecture": LlamaForCausalLM,
"dtype": "float32",
"hidden_size": 768,
"num_hidden_layers": 12,
"num_attention_heads": 12,
"quantization": quantization_dict,
"vocab_size": 49152,
"max_position_embeddings": 16384,
LLM NIM parses the model configuration files to check if your model architecture is supported for inference backend and only makes the supported backends available.
Known Architecture#
You can reliably deploy a model with a known architecture to any of the backends as long as it’s present in the mentioned architecture map.
vLLM supported architectures. vLLM relies on transformer models as a fallback option so the architecture map is built using the transformers library
TRTLLM supported architectures
SGLang supported architectures
Unknown Architecture#
If a model with an unknown architecture is deployed, LLM NIM will raise an exception and terminate execution.
Model Format#
You can deploy local models or pull weights HuggingFace. NIM chooses an optimal backend for inference given the weights in the HuggingFace format/TRTLLM checkpoint format or a TRTLLM engine format.
HuggingFace repository link requires
HF_TOKEN
to be set from the environment. NIM downloads the HF repository usingnim-lib
if the URL is of the formhf://<org>/<model-name>
.Local model folder path
Note
After downloading from the HF repository, NGC repository, or using a local model folder, the cached model must follow specific folder structures. If it does not, NIM will raise an exception: “Unknown weight format detected.” Refer to the list of supported model formats for more details on supported folder structures.
Model Precision#
Models with full precision format and quantized formats can be reliably deployed in NIM. NIM will pick an optimal backend based on the detected weight format and quantized formats from the model configuration files.
Refer to quantized model support for more details on supported precision formats.
Profile Details for LLM-specific NIMs#
Optimized Profiles vs. Local Build Profiles#
Both optimized and generic local build profiles use TensorRT-LLM, but they differ in a few ways.
Optimized profiles leverage GPU specific TensorRT-LLM options in order to achieve optimal throughput and latency, and using an optimized profile causes NIM to download a pre-compiled TensorRT-LLM engine.
Local build profiles use heuristics to choose a balanced set of options, and using a local build profile causes NIM to download the raw model weights and perform compilation on the local system. This can lead to longer startup times when deploying with a local build profile that has not been previously deployed and cached.
Optimization Targets#
optimized
profiles can have different optimization targets, catered to either minimize latency or maximize throughput.
These engine profiles are tagged as latency
and throughput
respectively in the profile name and the model manifest file included with the NIM.
latency
profiles are designed to minimize:
Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
Inter-Token Latency (ITL): The latency between each token after the first.
throughput
profiles are designed to maximize:
Total Throughput per GPU: The total number of tokens generated per second by the NIM, divided by the number of GPUs used.
While there can be many differences between the throughput and latency variants to meet these different criteria, one of the most significant is that throughput variants utilize the minimum number of GPUs required to host a model (typically constrained by memory utilization). Latency variants use additional GPUs to decrease request latency at the cost of decreased total throughput per GPU relative to the throughput variant.
Quantization#
For some models and GPU configurations, quantized engines with reduced numerical precision are available.
These can be identified by the numeric format included in the profile name.
For example, fp8
is included in the name for models that have been quantized to 8-bit floating-point values as opposed to fp16
for non-quantized 16-bit floating-point values.
All quantized engines are rigorously tested to meet the same accuracy criteria as the default fp16
engines.
Because of this accuracy testing, and because quantization leads to reduced memory requirements and therefore significant improvements in both latency and throughput,
fp8
models are chosen by default when available.
Currently, quantized profiles fall into the following categories:
optimized
NIM currently supports fp8 quantization of HF and Nemotron models.
To deploy non-quantized fp16
engines, follow the Profile Selection steps.