Model Profiles and Selection#
Every model-specific NIM LLM container ships with a model manifest — a catalog of one or more profiles that NIM can use to select a model configuration at startup. Each profile represents a specific, pre-validated configuration, defined by a set of tags corresponding to the backend engine, model precision, tensor parallelism size (TP), pipeline parallelism size (PP), and LoRA support.
In the case of model-free NIM LLM, the model manifest is generated at runtime with a set of generic profiles that can be used to deploy the NIM across a wide range of system configurations.
At container startup, NIM selects exactly one profile from the manifest. The selected profile determines which model files are downloaded and how the inference backend is launched.
Profile Naming Convention#
Deployment profiles follow the naming pattern:
vllm-<precision>-tp<N>-pp1[-lora]
Where:
<precision>is the quantization format (bf16,fp8,mxfp4, ornvfp4)tp<N>is the tensor parallelism degree (number of GPUs)pp1indicates single-stage pipeline parallelismThe
-lorasuffix indicates the profile supports LoRA adapter loading
For example, vllm-bf16-tp4-pp1-lora uses BF16 precision across four GPUs with LoRA support.
Listing Available Profiles#
Run the list-model-profiles command to see which profiles are available in a container:
docker run --rm --gpus=all \
<nim_llm_image> \
list-model-profiles
Example output:
MODEL PROFILES
- Compatible with system and runnable:
- dcec66a50892315842bdc46d5b2d8648fed3fe3d3382437f0a811c56eff8c39c (vllm-bf16-tp1-pp1) [requires >=18 GB/gpu]
- With LoRA support:
- d66193b819d2bc2ae40aefcec0da5997b5f9187dd79b8155ec111b16999d18e0 (vllm-bf16-tp1-pp1-feat_lora) [requires >=22 GB/gpu]
- Compatible with system but low memory:
- a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2 (vllm-bf16-tp1-pp1) [requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]
- Incompatible with system:
- 27af459c9caa0f9b34d5e07e5962960df6b0120df2039d06148e0e63595195e5 (vllm-bf16-tp2-pp1)
- 30d16624c8100d40e6cde3af7f4e4ff6028f776e92efdcf09fcb515ae65662c0 (vllm-bf16-tp4-pp1)
- 6f888502f35dc189f8c67f3e11174028a4ce42e92868e6a0ca10ef1d84953874 (vllm-bf16-tp8-pp1)
Each profile has:
Profile ID: A unique 64-character profile ID.
Profile description: A human-readable string constructed by joining tag values with hyphens (for example,
vllm-fp16-tp1-pp1).Memory annotation: An estimated VRAM requirement per GPU, shown in brackets (for example,
[requires >=18 GB/gpu]).
Memory-Based Profile Classification#
NIM estimates the GPU VRAM required by each profile and classifies it into one of three categories based on the available memory on the system:
Category |
Meaning |
Action |
|---|---|---|
Compatible |
Estimated VRAM fits within available GPU memory |
Profile can be selected and deployed |
Low memory |
Model weights fit, but full context length exceeds available memory |
Profile can run with a reduced |
Incompatible |
Model weights alone exceed available GPU memory |
Profile cannot run on this hardware. Consider using a profile with higher tensor parallelism or a quantized precision. |
If a profile is classified as low memory, the listing output includes a suggestion. For example:
[requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]
You can apply the suggestion by passing the --max-model-len argument:
docker run --rm -it --gpus=all \
-p 8000:8000 \
<nim_llm_image> \
--max-model-len 4096
Note
Reducing --max-model-len limits the maximum sequence length (input + output tokens) per request. Choose a value that fits your use case.
How Profile Selection Works#
NIM uses a priority-ordered selection chain to decide which profile to use. The chain is evaluated top-to-bottom; the first selector that produces a match wins.
Priority |
Selector |
Trigger |
Description |
|---|---|---|---|
1 (highest) |
Default profile selector |
|
Selects the first hardware-compatible profile using backend priority. |
2 |
Environment-based profile selector |
|
Matches an explicit profile by checksum or description. |
3 |
Memory-aware profile selector |
(automatic) |
Estimates VRAM requirements for each profile and filters out profiles that exceed available GPU memory. Prefers non-LoRA profiles unless LoRA is enabled. |
4 (lowest) |
Manifest profile selector |
(no env var set) |
Uses |
The memory-aware selector runs automatically as part of the selection chain. It estimates GPU memory requirements for each candidate profile by analyzing model weights, KV cache, activations, and overhead. Profiles that do not fit in available GPU memory are excluded from selection.
Selecting a Profile#
The method you use to select a profile depends on your requirements and environment. You can allow NIM to pick a suitable profile automatically, or you can explicitly specify the exact profile you want by ID or by description.
Automatic Selection (Default)#
If you do not set NIM_MODEL_PROFILE, NIM automatically selects the best compatible profile from the manifest based on your hardware (GPU device, available VRAM, estimated memory requirements, and parallelism constraints).
docker run --rm -it --gpus=all \
-p 8000:8000 \
<nim_llm_image>
Intelligent Default Selection#
Setting NIM_MODEL_PROFILE to "default" triggers intelligent default selection. NIM picks the best compatible profile based on:
Hardware compatibility (GPU device, VRAM)
Backend priority
LoRA configuration
docker run --rm -it --gpus=all \
-e NIM_MODEL_PROFILE="default" \
-p 8000:8000 \
<nim_llm_image>
Explicit Selection by Profile ID#
Specify the full Profile ID for deterministic, version-safe selection — the profile is guaranteed to match even if tags are later modified.
docker run --rm -it --gpus=all \
-e NIM_MODEL_PROFILE="70edb8bb9f8511ce2ea195e3caebcc3c7191dc27fea0c8d4acf9c0d9a69e43cd" \
-p 8000:8000 \
<nim_llm_image>
Explicit Selection by Profile Description (Friendly Name)#
If the value of NIM_MODEL_PROFILE is not a valid Profile ID, NIM tries to match it against the profile description — a human-readable string constructed from ordered profile tags.
docker run --rm -it --gpus=all \
-e NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1 \
-p 8000:8000 \
<nim_llm_image>
Tip
Use list-model-profiles to discover the exact profile IDs and descriptions available in your container.
Configuration Precedence#
NIM_MODEL_PROFILE provides a convenient way to specify deployment defaults, but it can be overridden by backend-native arguments. The precedence hierarchy is:
Backend-native arguments (highest precedence): CLI arguments or flags passed directly to the backend (e.g., vLLM CLI arguments to tensor parallelism) always take precedence.
NIM_MODEL_PROFILEconfiguration (lower precedence): Settings parsed from the profile are applied as defaults unless explicitly overridden by a backend argument.
For example, if a profile specifies tp=2 but the user also passes --tensor-parallel-size 4 as a vLLM CLI argument, the backend launches with TP=4.
Important
When backend arguments override profile settings, the overridden values are resolved before model download. NIM selects and downloads the profile that matches the final resolved configuration, so the downloaded model files always match the launch configuration.
Using vLLM CLI Arguments#
You can also control parallelism and other settings directly through vLLM CLI arguments instead of (or in addition to) NIM_MODEL_PROFILE:
docker run --rm -it --gpus=all \
-p 8000:8000 \
<nim_llm_image> \
--tensor-parallel-size 2
Common vLLM CLI arguments include the following:
vLLM CLI Argument |
Purpose |
Default |
|---|---|---|
|
Number of tensor-parallel GPUs |
1 |
|
Number of pipeline-parallel stages |
1 |
|
Enable LoRA adapter support |
Disabled |
Changes from NIM LLM 1.x#
The following profile selection mechanisms from NIM LLM version 1.x are no longer supported:
Removed Feature |
1.x Example |
|---|---|
Custom profile selectors |
|
LLM-based profile selector (backend priority chain) |
Automatic backend priority: TensorRT-LLM > vLLM > SGLang |
Tag-based profile selector |
|
Tip
Use NIM_MODEL_PROFILE with a profile ID or description as a replacement for these deprecated mechanisms. For further guidance, see the 1.x Migration Guide.