Is this page helpful?

Model Profiles and Selection#

Every model-specific NIM LLM container ships with a model manifest — a catalog of one or more profiles that NIM can use to select a model configuration at startup. Each profile represents a specific, pre-validated configuration, defined by a set of tags corresponding to the backend engine, model precision, tensor parallelism size (TP), pipeline parallelism size (PP), and LoRA support.

In the case of model-free NIM LLM, the model manifest is generated at runtime with a set of generic profiles that can be used to deploy the NIM across a wide range of system configurations.

At container startup, NIM selects exactly one profile from the manifest. The selected profile determines which model files are downloaded and how the inference backend is launched.

Profile Naming Convention#

Deployment profiles follow the naming pattern:

<backend>-<precision>-tp<N>-pp1[-lora]

Where:

<backend> is the inference engine (vllm, sglang, or trtllm). The backend must match the inference engine in the container image you run. For example, the SGLang container image only loads sglang-* profiles.
<precision> is the quantization format (bf16, fp8, mxfp4, or nvfp4)
tp<N> is the tensor parallelism degree (number of GPUs)
pp1 indicates single-stage pipeline parallelism
The -lora suffix indicates the profile supports LoRA adapter loading

For example, vllm-bf16-tp4-pp1-lora uses BF16 precision across four GPUs with LoRA support on the vLLM backend, and sglang-bf16-tp4-pp1 runs the equivalent configuration on the SGLang backend.

Listing Available Profiles#

Run the list-model-profiles command to see which profiles are available in a container:

docker run --rm --gpus=all \
  <nim_llm_image> \
  list-model-profiles

Example output:

MODEL PROFILES
- Compatible with system and runnable:
  - dcec66a50892315842bdc46d5b2d8648fed3fe3d3382437f0a811c56eff8c39c (vllm-bf16-tp1-pp1) [requires >=18 GB/gpu]
  - With LoRA support:
    - d66193b819d2bc2ae40aefcec0da5997b5f9187dd79b8155ec111b16999d18e0 (vllm-bf16-tp1-pp1-feat_lora) [requires >=22 GB/gpu]
- Compatible with system but low memory:
  - a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2 (vllm-bf16-tp1-pp1) [requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]
- Incompatible with system:
    - 27af459c9caa0f9b34d5e07e5962960df6b0120df2039d06148e0e63595195e5 (vllm-bf16-tp2-pp1)
    - 30d16624c8100d40e6cde3af7f4e4ff6028f776e92efdcf09fcb515ae65662c0 (vllm-bf16-tp4-pp1)
    - 6f888502f35dc189f8c67f3e11174028a4ce42e92868e6a0ca10ef1d84953874 (vllm-bf16-tp8-pp1)

Note

The example above shows output from a vLLM container image. The SGLang container image emits profiles with sglang-* descriptions instead (for example, sglang-bf16-tp1-pp1).

Each profile has:

Profile ID: A unique 64-character profile ID.
Profile description: A human-readable string constructed by joining tag values with hyphens (for example, vllm-fp16-tp1-pp1).
Memory annotation: An estimated VRAM requirement per GPU, shown in brackets (for example, [requires >=18 GB/gpu]).

Memory-Based Profile Classification#

NIM estimates the GPU VRAM required by each profile and classifies it into one of three categories based on the available memory on the system:

Category	Meaning	Action
Compatible	Estimated VRAM fits within available GPU memory	Profile can be selected and deployed
Low memory	Model weights fit, but full context length exceeds available memory	Profile can run with a reduced `--max-model-len`. The listing includes a suggested value.
Incompatible	Model weights alone exceed available GPU memory	Profile cannot run on this hardware. Consider using a profile with higher tensor parallelism or a quantized precision.

If a profile is classified as low memory, the listing output includes a suggestion. For example:

[requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]

You can apply the suggestion by passing the --max-model-len argument:

docker run --rm -it --gpus=all \
  -p 8000:8000 \
  <nim_llm_image> \
  --max-model-len 4096

Note

Reducing --max-model-len limits the maximum sequence length (input + output tokens) per request. Choose a value that fits your use case.

How Profile Selection Works#

NIM uses a priority-ordered selection chain to decide which profile to use. The chain is evaluated top-to-bottom; the first selector that produces a match wins.

Priority	Selector	Trigger	Description
1 (highest)	Default profile selector	`NIM_MODEL_PROFILE="default"`	Selects the first hardware-compatible profile using backend priority.
2	Environment-based profile selector	`NIM_MODEL_PROFILE=<id-or-name>`	Matches an explicit profile by checksum or description.
3	Memory-aware profile selector	(automatic)	Estimates VRAM requirements for each profile and filters out profiles that exceed available GPU memory. Prefers non-LoRA profiles unless LoRA is enabled.
4 (lowest)	Manifest profile selector	(no env var set)	Uses `profile_selection_criteria` from the manifest to pick a profile automatically based on hardware compatibility (GPU device, VRAM) and parallelism settings.

The memory-aware selector runs automatically as part of the selection chain. It estimates GPU memory requirements for each candidate profile by analyzing model weights, KV cache, activations, and overhead. Profiles that do not fit in available GPU memory are excluded from selection.

Selecting a Profile#

The method you use to select a profile depends on your requirements and environment. You can allow NIM to pick a suitable profile automatically, or you can explicitly specify the exact profile you want by ID or by description.

Automatic Selection (Default)#

If you do not set NIM_MODEL_PROFILE, NIM automatically selects the best compatible profile from the manifest based on your hardware (GPU device, available VRAM, estimated memory requirements, and parallelism constraints).

docker run --rm -it --gpus=all \
  -p 8000:8000 \
  <nim_llm_image>

Intelligent Default Selection#

Setting NIM_MODEL_PROFILE to "default" triggers intelligent default selection. NIM picks the best compatible profile based on:

Hardware compatibility (GPU device, VRAM)
Backend priority
LoRA configuration

docker run --rm -it --gpus=all \
  -e NIM_MODEL_PROFILE="default" \
  -p 8000:8000 \
  <nim_llm_image>

Explicit Selection by Profile ID#

Specify the full Profile ID for deterministic, version-safe selection — the profile is guaranteed to match even if tags are later modified.

docker run --rm -it --gpus=all \
  -e NIM_MODEL_PROFILE="70edb8bb9f8511ce2ea195e3caebcc3c7191dc27fea0c8d4acf9c0d9a69e43cd" \
  -p 8000:8000 \
  <nim_llm_image>

Explicit Selection by Profile Description (Friendly Name)#

If the value of NIM_MODEL_PROFILE is not a valid Profile ID, NIM tries to match it against the profile description — a human-readable string constructed from ordered profile tags.

docker run --rm -it --gpus=all \
  -e NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1 \
  -p 8000:8000 \
  <nim_llm_image>

Note

On an SGLang container image, use the matching sglang- profile description (or its hash) instead, for example sglang-fp16-tp1-pp1.

Tip

Use list-model-profiles to discover the exact profile IDs and descriptions available in your container.

Configuration Precedence#

NIM_MODEL_PROFILE provides a convenient way to specify deployment defaults, but it can be overridden by backend-native arguments. The precedence hierarchy is:

Backend-native arguments (highest precedence): CLI arguments or flags passed directly to the backend (the vLLM or SGLang passthrough arguments, for example those that set tensor parallelism) always take precedence.
NIM_MODEL_PROFILE configuration (lower precedence): Settings parsed from the profile are applied as defaults unless explicitly overridden by a backend argument.

For example, if a profile specifies tp=2 but the user also passes the backend’s tensor-parallelism flag set to 4 (such as the vLLM --tensor-parallel-size 4 argument), the backend launches with TP=4.

Important

When backend arguments override profile settings, the overridden values are resolved before model download. NIM selects and downloads the profile that matches the final resolved configuration, so the downloaded model files always match the launch configuration.

Using Backend CLI Arguments#

You can also control parallelism and other settings directly through backend CLI arguments instead of (or in addition to) NIM_MODEL_PROFILE. The NIM passthrough mechanism is the same for both the vLLM and SGLang backends, but the flag names are backend-specific. The following example uses vLLM flags:

docker run --rm -it --gpus=all \
  -p 8000:8000 \
  <nim_llm_image> \
  --tensor-parallel-size 2

Common vLLM CLI arguments include the following:

vLLM CLI Argument	Purpose	Default
`--tensor-parallel-size`	Number of tensor-parallel GPUs	1
`--pipeline-parallel-size`	Number of pipeline-parallel stages	1
`--enable-lora`	Enable LoRA adapter support	Disabled

Note

On an SGLang image, the equivalent flag names differ. Refer to the SGLang documentation for the corresponding passthrough arguments. In Kubernetes or NIM Operator deployments, pass these backend arguments through the NIM_PASSTHROUGH_ARGS environment variable.

Changes from NIM LLM 1.x#

The following profile selection mechanisms from NIM LLM version 1.x are no longer supported:

Removed Feature	1.x Example
Custom profile selectors	`-e NIM_CUSTOM_SELECTOR_CLASSES="my_selector.MyCustomSelector"`
LLM-based profile selector (backend priority chain)	Automatic backend priority: TensorRT-LLM > vLLM > SGLang
Tag-based profile selector	`-e NIM_TAGS_SELECTOR="llm_engine=vllm,tp=1"`

Tip

Use NIM_MODEL_PROFILE with a profile ID or description as a replacement for these deprecated mechanisms. For further guidance, see the 1.x Migration Guide.

Model Profiles and Selection#

Profile Naming Convention#

Listing Available Profiles#

Memory-Based Profile Classification#

How Profile Selection Works#

Selecting a Profile#

Automatic Selection (Default)#

Intelligent Default Selection#

Explicit Selection by Profile ID#

Explicit Selection by Profile Description (Friendly Name)#

Configuration Precedence#

Using Backend CLI Arguments#

Changes from NIM LLM 1.x#

Related Topics#