Advanced Configuration#
NIM LLM uses a layered configuration system that resolves values from multiple sources — CLI arguments, environment variables, runtime config files, and model profile tags — with well-defined priorities and provenance tracking. This page describes how the configuration system works and how to use it for advanced deployment scenarios.
For basic configuration (model path, cache, logging), see the environment variables reference. For vLLM-specific CLI arguments, see the vLLM CLI documentation.
Configuration Priority#
Configuration values are resolved from multiple sources with the following priority (highest to lowest):
Priority |
Source |
Description |
|---|---|---|
1 (highest) |
CLI Arguments |
Arguments passed after |
2 |
Passthrough Arguments |
|
3 |
Environment Variables |
NIM-specific |
4 |
Runtime Config |
|
5 |
Profile Tags |
Values from nimlib profile metadata (e.g., |
6 (lowest) |
NIM Defaults |
Built-in defaults applied only when no other source sets the field |
Higher-priority sources overwrite lower-priority sources. If the same parameter is set in multiple sources, the highest-priority value wins.
Configuration Sources#
CLI Arguments#
Command-line arguments passed to the container after nim-serve. These use vLLM’s CLI argument format:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.0 \
nim-serve --tensor-parallel-size 4 --enable-prefix-caching --gpu-memory-utilization 0.9
vLLM uses Python’s argparse.BooleanOptionalAction for boolean flags:
--enable-prefix-cachingsets the value toTrue--no-enable-prefix-cachingsets the value toFalse
If both NIM_PASSTHROUGH_ARGS and direct CLI arguments set the same parameter, the direct CLI argument takes precedence. If there are duplicate or contradictory CLI arguments (e.g., --enable-xyz followed by --no-enable-xyz), the last one wins.
Passthrough Arguments (NIM_PASSTHROUGH_ARGS)#
For environments where direct CLI arguments are not available (e.g., container orchestrators like Kubernetes), use NIM_PASSTHROUGH_ARGS to pass CLI-style arguments via an environment variable:
export NIM_PASSTHROUGH_ARGS="--tensor-parallel-size 4 --enable-prefix-caching --gpu-memory-utilization 0.9"
Passthrough arguments support all vLLM CLI arguments, boolean flags, and shell-style quoting via shlex.split().
Kubernetes example:
env:
- name: NIM_PASSTHROUGH_ARGS
value: "--tensor-parallel-size 4 --enable-prefix-caching"
Passthrough with JSON values:
export NIM_PASSTHROUGH_ARGS="--compilation-config '{\"pass_config\": {\"fuse_allreduce_rms\": false}}'"
NIM Environment Variables#
NIM defines a small set of environment variables that map to vLLM arguments. These provide a stable, NIM-specific interface for the most commonly used parameters:
NIM Environment Variable |
vLLM Argument |
Type |
Default |
|---|---|---|---|
|
|
int |
|
|
|
int |
|
|
|
int |
auto |
|
|
bool |
|
|
|
bool |
|
For any vLLM argument that does not have a dedicated NIM environment variable, use NIM_PASSTHROUGH_ARGS.
Runtime Config (runtime_config.json)#
A JSON file placed in the model workspace directory. NIM reads it automatically if present:
{
"tensor_parallel_size": 2,
"enable_prefix_caching": true,
"max_model_len": 8192
}
Unknown keys are passed through to vLLM as-is.
NIM Defaults#
Default values that are applied only if no other source sets the field:
Parameter |
Default |
|---|---|
|
|
|
|
Override Warnings and Strict Mode#
Override Detection#
When a higher-priority source overwrites a value from a lower-priority source, NIM logs a warning:
WARNING: Config override: 'tensor_parallel_size' changed from 2 (RUNTIME) to 8 (CLI)
Strict Mode#
Set NIM_STRICT_ARG_PROCESSING=true to treat override warnings as errors:
export NIM_STRICT_ARG_PROCESSING=true
In strict mode, the container exits with an error if any configuration override is detected between non-default sources. This is useful for:
CI/CD pipelines where you want to catch configuration conflicts.
Production deployments where configuration should be deterministic.
Example:
export NIM_STRICT_ARG_PROCESSING=true
export NIM_TENSOR_PARALLEL_SIZE=2
docker run ... nim-serve --tensor-parallel-size 4
Result:
ERROR: Config override detected in strict mode: 'tensor_parallel_size' changed from 2 (ENV) to 4 (CLI).
Set NIM_STRICT_ARG_PROCESSING=false to allow intentional overrides.
Dry Run#
Use --dry-run with nim-serve to print the fully resolved configuration and the resulting vLLM arguments without starting the server:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_TENSOR_PARALLEL_SIZE=2 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.0 \
nim-serve --dry-run
This prints the resolved configuration with provenance for each parameter, showing which source provided each value.
Denied Arguments#
Certain vLLM CLI arguments are blocked in NIM containers because they are managed by the nginx proxy or other system components. If you pass a denied argument, NIM logs a warning and ignores it.
Denied Argument |
Reason |
NIM Alternative |
|---|---|---|
|
Networking is managed by nginx |
— |
|
Port is managed by nginx |
|
|
SSL/TLS is managed by nginx |
|
|
SSL/TLS is managed by nginx |
|
|
SSL/TLS is managed by nginx |
|
|
SSL/TLS is managed by nginx |
|
|
API routing is managed by nginx |
— |
|
Authentication should be managed externally |
— |
|
Logging is managed by NIM |
|
|
Middleware is managed by nginx |
— |
|
CORS should be configured in nginx |
|
|
CORS should be configured in nginx |
|
|
CORS should be configured in nginx |
|
Validation and Error Handling#
Where Validation Occurs#
Stage |
What’s Validated |
Error Behavior |
|---|---|---|
CLI Parsing |
Argument format, type conversion |
Warning logged, invalid arg skipped |
Env Parsing |
Type conversion |
Warning logged, invalid value skipped |
Runtime Config |
JSON syntax, type conversion |
Warning logged, invalid key skipped |
Config Merge |
Override detection |
Warning (or error in strict mode) |
vLLM Startup |
Argument validity, model compatibility |
vLLM exits with error |
Invalid Parameter Handling#
Invalid NIM parameters: Logged as warnings; the container continues with the remaining valid values.
Unknown parameters: Passed through to vLLM as-is. vLLM handles its own validation.
Invalid vLLM parameters: vLLM validates and reports errors at startup.
Example Error Messages#
# NIM-side type error (warning, non-fatal)
WARNING: Failed to parse CLI arg '--tensor-parallel-size' with value 'abc': invalid literal for int()
# vLLM-side validation error (fatal)
ValueError: tensor_parallel_size must be a positive integer
Provenance Tracking#
NIM tracks the source of every configuration parameter. At startup, a structured config_resolved JSON event is emitted showing the provenance of each resolved value:
{
"tensor_parallel_size": {"value": 4, "source": "CLI"},
"pipeline_parallel_size": {"value": 1, "source": "PROFILE"},
"gpu_memory_utilization": {"value": 0.9, "source": "ENV"},
"enable_prefix_caching": {"value": true, "source": "RUNTIME"}
}
When a value is overridden, the previous value and source are also recorded:
{
"tensor_parallel_size": {
"value": 4,
"source": "CLI",
"previous_value": 2,
"previous_source": "ENV"
}
}
Use nim-serve --dry-run to inspect the full provenance report without starting the server.
GPU Memory Management#
NIM includes automatic GPU memory management to prevent out-of-memory (OOM) failures on constrained hardware. This involves two mechanisms: automatic clamping of gpu_memory_utilization and a post-selection memory warning.
Automatic GPU Memory Clamping#
On certain hardware configurations, NIM automatically reduces the gpu_memory_utilization parameter (which controls what fraction of GPU memory vLLM is allowed to use) to prevent OOM errors:
Condition |
Cap |
Reason |
|---|---|---|
vGPU guest (partitioned GPU) |
0.75 |
GPU memory is shared or partitioned across virtual machines |
UMA device (DGX Spark, GH200) |
0.50 |
CPU and GPU share the same physical memory |
Busy GPU (other processes using memory) |
|
Prevents contention with other workloads |
The clamped value is applied at a low internal priority, so any explicit setting from environment variables, CLI arguments, or --gpu-memory-utilization takes precedence automatically. The minimum floor is 0.10 (NIM never sets GPU memory utilization below 10%).
Post-Selection Memory Warning#
After selecting a profile and resolving all configuration, NIM estimates the total VRAM required for the model (weights, KV cache, activations, and overhead) and compares it to available GPU memory. If the estimate exceeds available memory, NIM logs an advisory warning:
WARNING: Estimated VRAM (45.2 GB) exceeds available GPU memory (39.6 GB).
Consider reducing context length with --max-model-len=4096 (estimated 30.1 GB).
This warning is advisory only – NIM proceeds with startup regardless. The suggested --max-model-len value, if provided, indicates a context length that would fit within available memory.
To apply the suggestion, pass --max-model-len as a CLI argument:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.0 \
--max-model-len 4096
SSL/TLS Configuration#
NIM terminates TLS at the nginx proxy layer. vLLM’s native SSL arguments (--ssl-keyfile, --ssl-certfile, etc.) are denied — use the NIM SSL variables instead.
Variable |
Description |
Default |
|---|---|---|
|
SSL mode: |
|
|
Path to TLS private key |
— |
|
Path to TLS certificate |
— |
|
Path to CA certificate (required for |
— |
TLS mode (NIM_SSL_MODE=TLS): Server presents a certificate; client certificate is not required. Requires NIM_SSL_CERTS_PATH and NIM_SSL_KEY_PATH.
Mutual TLS mode (NIM_SSL_MODE=MTLS): Both server and client present certificates. Requires NIM_SSL_CERTS_PATH, NIM_SSL_KEY_PATH, and NIM_SSL_CA_CERTS_PATH.
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_SSL_MODE=TLS \
-e NIM_SSL_KEY_PATH=/certs/server.key \
-e NIM_SSL_CERTS_PATH=/certs/server.crt \
-v /path/to/certs:/certs:ro \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.0
CORS Configuration#
CORS is handled by the nginx proxy. The following variables control CORS headers:
Variable |
Description |
Default |
|---|---|---|
|
Allowed origins |
|
|
Allowed HTTP methods |
|
|
Allowed request headers |
|
|
Headers exposed to the browser |
|
|
Preflight cache duration (seconds) |
|
Note
vLLM’s --allowed-origins, --allowed-methods, and --allowed-headers arguments are denied in NIM because CORS is managed at the nginx layer. Use the NIM_CORS_* environment variables instead.
Examples#
CLI Overrides Environment Variable#
export NIM_TENSOR_PARALLEL_SIZE=2
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_TENSOR_PARALLEL_SIZE \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.0 \
nim-serve --tensor-parallel-size 4
Result: tensor_parallel_size = 4 (CLI wins over env)
WARNING: Config override: 'tensor_parallel_size' changed from 2 (ENV) to 4 (CLI)
Full Priority Chain#
Given all sources set tensor_parallel_size:
runtime_config.json: {"tensor_parallel_size": 1, "enable_prefix_caching": true}
Environment: NIM_TENSOR_PARALLEL_SIZE=2
NIM_PASSTHROUGH_ARGS: --tensor-parallel-size 3
CLI: nim-serve --tensor-parallel-size 4 --no-enable-prefix-caching
Result:
tensor_parallel_size = 4(CLI overrides passthrough overrides env overrides runtime)enable_prefix_caching = False(CLI--no-overrides runtimetrue)
Kubernetes with Passthrough Args#
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: nim-llm
image: <NIM_LLM_MODEL_FREE_IMAGE>:2.0.0
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: api-key
- name: NIM_MODEL_PATH
value: "ngc://nim/meta/llama-3.1-8b-instruct"
- name: NIM_CACHE_PATH
value: "/opt/nim/.cache"
- name: NIM_PASSTHROUGH_ARGS
value: "--enable-prefix-caching --max-num-batched-tokens 8192 --enable-chunked-prefill"
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /v1/health/live
port: 8000
readinessProbe:
httpGet:
path: /v1/health/ready
port: 8000
Enabling TLS#
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_SSL_MODE=TLS \
-e NIM_SSL_KEY_PATH=/certs/server.key \
-e NIM_SSL_CERTS_PATH=/certs/server.crt \
-v /path/to/certs:/certs:ro \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.0
curl --cacert /path/to/certs/ca.crt \
https://localhost:8000/v1/health/ready