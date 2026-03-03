Model Profiles in NVIDIA NIM for LLMs#

A NIM model profile defines two things: the runtime the NIM uses, and the criteria NIM should use to choose those engines. Unique strings based on a hash of the profile contents identify each profile.

NVIDIA provides optimized model profiles for popular data-center GPU models, different GPU counts, and specific numeric precisions. For LLM-specific NIMs with NVIDIA validated models, the optimized model profiles have model-specific and hardware-specific optimizations to improve the performance of the model. Optimized profiles can be provided as pre-compiled TensorRT-LLM engines or as backend runtime configurations across supported backends (for example: tensorrt_llm PyTorch, vllm , and sglang ). When runtime configurations are used, files such as runtime_params.json may appear under a backend_optimization/ namespace in the model workspace to avoid naming conflicts. There may be several different valid optimized model profiles for a given system, in which case you can select a profile at deployment time by following the steps in Profile Selection. If you do not manually select a profile at deployment time, NIM will choose a profile automatically according to the rules laid out in Automatic Profile Selection.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. A generic profile is one that does not have a specific gpu tag in its definition. On systems where there are no compatible hardware-specific profiles, generic profiles are chosen automatically. Hardware-specific profiles are always preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps in Profile Selection.

Model profiles are embedded within the NIM container in a model manifest file, which is placed by default at /opt/nim/etc/default/model_manifest.yaml within the container file system.

Important LLM-specific NIM containers provide profiles that validate accuracy and performance characteristics across different hardware configurations (model/GPU combinations). The multi-LLM compatible NIM container provides a set of generic runtime options for profile selection, although, not all runtimes are supported by all model architectures.

Listing Profiles# You can list the profiles using the list-model-profiles utility, as shown in the following example: # $IMG_NAME must be a valid NIM container profile ID, # such as #751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c # in the following example docker run --rm --gpus = all -e NGC_API_KEY = $NGC_API_KEY $IMG_NAME list-model-profiles Output: MODEL PROFILES - Compatible with system and runnable: - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c ( tensorrt_llm-A100-fp16-tp2-latency ) - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c ( tensorrt_llm-A100-fp16-tp1-throughput ) - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f ( vllm-fp16-tp2 ) - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d ( vllm-fp16-tp1 )

Profile Selection# To select a profile for deployment, set a specific profile ID with -e NIM_MODEL_PROFILE=ID , where ID is a profile ID returned by the list-model-profiles utility, as shown in the previous example. For example, you can set -e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput" or -e NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c" to run the A100 TP1 profile. docker run -it --rm --name = $CONTAINER_NAME \ --gpus all \ --shm-size = 16GB \ -e NGC_API_KEY = $NGC_API_KEY \ -e NIM_MODEL_PROFILE = "tensorrt_llm-A100-fp16-tp1-throughput \ -v " $LOCAL_NIM_CACHE :/opt/nim/.cache " \ -u $( id -u ) \ -p 8000:8000 \ $IMG_NAME docker run --rm --gpus = all -e NGC_API_KEY = $NGC_API_KEY $IMG_NAME list-model-profiles --model <path to local model or HF link>

Automatic Profile Selection# For LLM-specific NIMs, the profile sorting logic is based on the parameters involved. For the multi-LLM compatible NIM container, automatic profile selection is determined based on what engines will be supported and is subject to change. If you prefer not to rely on automatic profile selection, you can manually select a profile using the following steps: Run list-model-profiles for the model of interest by passing the --model $NIM_MODEL_NAME flag. This will list all supported profiles for that model. Benchmark the listed profiles and evaluate them based on key metrics like Time to First Token (TTFT), Inter-Token Latency (ITL), and Total Throughput per GPU. Select the profile which delivers the best performance based on the metric most relevant to your use case. If no profile is explicitly chosen, NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. NIM uses the following ordered criteria to select the best profile: Custom Profiles: Before anything, if there exists a cached Fine-Tuned custom profile, it will be selected. Optimized Profiles: If any optimized profiles exist for the detected GPUs, one will be chosen according to the following sorting logic in order: Hardware: NIM will always prefer a profile that contains a gpu tag matching the system’s hardware over a profile that does not. This priority applies across all backend engines. For example, a hardware-specific vllm profile will be chosen over a profile of any backend lacking a matching gpu tag. Backend Engine: If multiple profiles are equally compatible after considering hardware specificity (for example, both are hardware-specific for the detected GPU or both are generic), the primary selection factor is the backend engine. NIM prefers profiles in this order: tensorrt_llm > vllm > sglang . For example, all tensorrt_llm profiles will be considered before any vllm profiles that have the same hardware specificity. Precision: For profiles with the same backend and hardware specificity, the next factor is numeric precision. NIM balances efficiency with accuracy by preferring floating-point formats over integer quantization when possible. In general, the preference is MXFP4 > FP8 > INT8 > FP16 > BF16 > INT8WO > NVFP4 > INT4_AWQ . For more information, refer to Quantization. Optimization Target: If a tie still exists, latency-optimized profiles are selected over throughput-optimized profiles by default. Tensor Parallelism (TP): If multiple profiles are still tied after evaluating the criteria above, NIM prefers the profile with the highest Tensor Parallelism (TP) value that is supported by the number of available GPUs. For example, if both TP2 and TP4 profiles are available and the system has 4 GPUs, the TP4 profile will be selected. This also explains why a profile with lower TP might be chosen if it has a more preferred precision. Note Within the TensorRT-LLM backend, non-buildable (pre-built) profiles are preferred over trtllm_buildable profiles when all other factors are equal. Generic Profiles: If the multi-LLM NIM container is deployed, or if no optimized profiles are compatible in LLM-specific NIMs, a generic profile will be chosen based on the following criteria: Backend Engine: NIM will default to using a compatible profile following this order: tensorrt_llm > vllm > sglang engine type. Precision: More efficient precisions are preferred when available. The preference order follows the same logic as optimized profiles. Only relevant for LLM-specific NIMs. Tensor Parallelism: Profiles with higher tensor parallelism (TP) values are preferred. Only relevant for LLM-specific NIMs. This selection will be logged at startup. For example: Detected 2 compatible profile(s). Valid profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) on GPUs [0] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0] Selected profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) Profile metadata: precision: fp16 Profile metadata: feat_lora: false Profile metadata: gpu: A100 Profile metadata: gpu_device: 20b2:10de Profile metadata: tp: 1 Profile metadata: llm_engine: tensorrt_llm Profile metadata: pp: 1 Profile metadata: profile: throughput Automatic Profile Selection for the Multi-LLM NIM# For the multi-LLM compatible NIM container, if no profile is explicitly chosen, LLM NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware and the specified model. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below: Profiles: A profile will be chosen based on the following criteria: Model Architecture Model Format Model Precision This selection will be logged at startup. For example: Detected 2 compatible profile(s). Valid profile: tensorrt_llm (tensorrt_llm) on GPUs [0] Valid profile: vllm (vllm) on GPUs [0] Valid profile: sglang (sglang) on GPUs [0] Selected profile: tensorrt_llm (tensorrt_llm) Profile metadata: llm_engine: tensorrt_llm Model Architecture# Model architectures are extracted from config.json files in each of the supported input formats in model format. Example of a HuggingFace LLama3.1-8b model configuration: "activation_function" : "gelu" , "architectures" : [ LlamaForCausalLM ] , "attention_dropout" : 0 .1, "residual_dropout" : 0 .1, "embedding_dropout" : 0 .1, Example of a TRTLLM configuration: "mapping" : { "pp_size" : 2 , "tp_size" : 2 , "world_size" : 4 } , "num_key_value_heads" : 2 , "head_size" : 64 , "architecture" : LlamaForCausalLM, "dtype" : "float32" , "hidden_size" : 768 , "num_hidden_layers" : 12 , "num_attention_heads" : 12 , "quantization" : quantization_dict, "vocab_size" : 49152 , "max_position_embeddings" : 16384 , LLM NIM parses the model configuration files to check if your model architecture is supported for inference backend and only makes the supported backends available. Known Architecture# You can reliably deploy a model with a known architecture to any of the backends as long as it’s present in the mentioned architecture map. vLLM supported architectures. vLLM relies on transformer models as a fallback option so the architecture map is built using the transformers library TRTLLM supported architectures SGLang supported architectures Unknown Architecture# If a model with an unknown architecture is deployed, LLM NIM will raise an exception and terminate execution. Model Format# You can deploy local models or pull weights HuggingFace. NIM chooses an optimal backend for inference given the weights in the HuggingFace format/TRTLLM checkpoint format or a TRTLLM engine format. HuggingFace repository link requires HF_TOKEN to be set from the environment. NIM downloads the HF repository using nim-lib if the URL is of the form hf://<org>/<model-name> . Local model folder path Note After downloading from the HF repository, NGC repository, or using a local model folder, the cached model must follow specific folder structures. If it does not, NIM will raise an exception: “Unknown weight format detected.” Refer to the list of supported model formats for more details on supported folder structures. Model Precision# Models with full precision format and quantized formats can be reliably deployed in NIM. NIM will pick an optimal backend based on the detected weight format and quantized formats from the model configuration files. Refer to quantized model support for more details on supported precision formats.

Profile Details for LLM-specific NIMs# Optimized Profiles vs. Local Build Profiles# Both optimized and generic local build profiles use TensorRT-LLM, but they differ in a few ways. Optimized profiles leverage GPU specific TensorRT-LLM options in order to achieve optimal throughput and latency, and using an optimized profile causes NIM to download a pre-compiled TensorRT-LLM engine. Local build profiles use heuristics to choose a balanced set of options, and using a local build profile causes NIM to download the raw model weights and perform compilation on the local system. This can lead to longer startup times when deploying with a local build profile that has not been previously deployed and cached. Optimization Targets# optimized profiles can have different optimization targets, catered to either minimize latency or maximize throughput. These engine profiles are tagged as latency and throughput respectively in the profile name and the model manifest file included with the NIM. latency profiles are designed to minimize: Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.

Inter-Token Latency (ITL): The latency between each token after the first. throughput profiles are designed to maximize: Total Throughput per GPU: The total number of tokens generated per second by the NIM, divided by the number of GPUs used. While there can be many differences between the throughput and latency variants to meet these different criteria, one of the most significant is that throughput variants utilize the minimum number of GPUs required to host a model (typically constrained by memory utilization). Latency variants use additional GPUs to decrease request latency at the cost of decreased total throughput per GPU relative to the throughput variant.