Model Profiles#

A NIM Model profile defines two things – what model engines NIM can use, and what criteria NIM should use to choose those engines. Unique strings based on a hash of the profile contents identify each profile.

Users may select a profile at deployment time by following the Profile Selection steps. If the user does not manually select a profile at deployment time, NIM will choose a profile automatically according to the rules laid out in Automatic Profile Selection. To understand how profiles and their corresponding engines are created, see How Profiles are Created.

Model profiles are embedded within the NIM container in a Model Manifest file, which is by default placed at /etc/nim/config/model_manifest.yaml within the container filesystem.

Profile Selection# To select a profile for deployment, set a specific profile ID with -e NIM_MODEL_PROFILE=<value> . You can find the valid profile IDs by using the list-model-profiles utility, as shown in the following example: docker run --rm --runtime = nvidia --gpus = all $IMG_NAME list-model-profiles MODEL PROFILES - Compatible with system and runnable: - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c ( tensorrt_llm-A100-fp16-tp2-latency ) - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c ( tensorrt_llm-A100-fp16-tp1-throughput ) - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f ( vllm-fp16-tp2 ) - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d ( vllm-fp16-tp1 ) To select , you can set -e NIM_MODEL_PROFILE="tensorrt_llm-A100-fp16-tp1-throughput" or -e NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c" to run the A100 TP1 profile.

Automatic Profile Selection# NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below: Compatibility Check: First, NIM filters out the profiles that are not runnable with the detected configuration based on the number and type of GPUs available. Backend: This can be either TensorRT-LLM or vLLM. The optimized TensorRT-LLM profiles are preferred over vLLM when available. Precision: Lower precision profiles are preferred when available. For example, NIM will automatically select FP8 profiles over FP16 . See Quantization for more details. Optimization Profile: Latency-optimized profiles are selected over throughput-optimized profiles by default. Tensor Parallelism: Profiles with higher tensor parallelism values are preferred. For example, a profile that requires 8 GPUs to run will be selected over one which requires 4 GPUs. This selection will be logged at startup. For example: Detected 2 compatible profile(s). Valid profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) on GPUs [0] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0] Selected profile: 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-A100-fp16-tp1-throughput) Profile metadata: precision: fp16 Profile metadata: feat_lora: false Profile metadata: gpu: A100 Profile metadata: gpu_device: 20b2:10de Profile metadata: tp: 1 Profile metadata: llm_engine: tensorrt_llm Profile metadata: pp: 1 Profile metadata: profile: throughput