Advanced Configuration#

NIM LLM uses a layered configuration system that resolves values from multiple sources — CLI arguments, environment variables, runtime config files, and model profile tags — with well-defined priorities and provenance tracking. This page describes how the configuration system works and how to use it for advanced deployment scenarios.

For basic configuration such as model path, cache, and logging, refer to Environment Variables. For vLLM-specific CLI arguments, refer to the vLLM CLI documentation.

Configuration Priority#

Configuration values are resolved from multiple sources with the following priority (highest to lowest):

Priority

Source

Description

1 (highest)

CLI Arguments

Arguments passed after nim-serve, e.g., nim-serve --tensor-parallel-size 4

2

Passthrough Arguments

NIM_PASSTHROUGH_ARGS environment variable, parsed as CLI-style arguments

3

Environment Variables

NIM-specific NIM_* variables that map to vLLM arguments

4

Runtime Config

runtime_config.json in the model workspace

5

Profile Tags

Values from nimlib profile metadata (e.g., tp, pp)

6 (lowest)

NIM Defaults

Built-in defaults applied only when no other source sets the field

Higher-priority sources overwrite lower-priority sources. If the same parameter is set in multiple sources, the highest-priority value wins.

Configuration Sources#

NIM resolves configuration values from the following sources.

CLI Arguments#

Command line arguments passed to the container after nim-serve use the vLLM CLI argument format:

docker run --gpus=all \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1 \
  nim-serve --tensor-parallel-size 4 --enable-prefix-caching --gpu-memory-utilization 0.9

vLLM uses Python’s argparse.BooleanOptionalAction for boolean flags:

  • --enable-prefix-caching sets the value to True

  • --no-enable-prefix-caching sets the value to False

If both NIM_PASSTHROUGH_ARGS and direct CLI arguments set the same parameter, the direct CLI argument takes precedence. If there are duplicate or contradictory CLI arguments (e.g., --enable-xyz followed by --no-enable-xyz), the last one wins.

Passthrough Arguments (NIM_PASSTHROUGH_ARGS)#

For environments where direct CLI arguments are not available, such as container orchestrators like Kubernetes, use NIM_PASSTHROUGH_ARGS to pass CLI-style arguments through an environment variable:

export NIM_PASSTHROUGH_ARGS="--tensor-parallel-size 4 --enable-prefix-caching --gpu-memory-utilization 0.9"

Passthrough arguments support all vLLM CLI arguments, boolean flags, and shell-style quoting using shlex.split().

For example, in Kubernetes:

env:
  - name: NIM_PASSTHROUGH_ARGS
    value: "--tensor-parallel-size 4 --enable-prefix-caching"

To pass JSON values through NIM_PASSTHROUGH_ARGS, use a command such as the following:

export NIM_PASSTHROUGH_ARGS="--compilation-config '{\"pass_config\": {\"fuse_allreduce_rms\": false}}'"

NIM Environment Variables#

NIM defines a small set of environment variables that map to vLLM arguments. These provide a stable, NIM-specific interface for the most commonly used parameters:

NIM Environment Variable

vLLM Argument

Type

Default

NIM_TENSOR_PARALLEL_SIZE

--tensor-parallel-size

int

1

NIM_PIPELINE_PARALLEL_SIZE

--pipeline-parallel-size

int

1

NIM_MAX_MODEL_LEN

--max-model-len

int

auto

NIM_TRUST_CUSTOM_CODE

--trust-remote-code

bool

false

NIM_DISABLE_CUDA_GRAPH

--enforce-eager

bool

false

For any vLLM argument that does not have a dedicated NIM environment variable, use NIM_PASSTHROUGH_ARGS.

Runtime Config#

You can place a runtime_config.json file in the model workspace directory. NIM reads it automatically if it is present:

{
  "tensor_parallel_size": 2,
  "enable_prefix_caching": true,
  "max_model_len": 8192
}

Unknown keys are passed through to vLLM as-is.

Profile Tags#

Model profiles include metadata tags that configure parallelism. These values are extracted automatically when a profile is selected:

Profile Tag

vLLM Parameter

tp

tensor_parallel_size

pp

pipeline_parallel_size

For example, a profile named vllm-fp8-tp2-pp1 sets tensor_parallel_size=2 and pipeline_parallel_size=1 at profile priority.

NIM Defaults#

The following default values are applied only if no other source sets the field:

Parameter

Default

tensor_parallel_size

1

pipeline_parallel_size

1

Override Warnings and Strict Mode#

NIM reports configuration overrides with warnings, and strict mode changes that behavior.

Override Detection#

When a higher-priority source overwrites a value from a lower-priority source, NIM logs a warning:

WARNING: Config override: 'tensor_parallel_size' changed from 2 (RUNTIME) to 8 (CLI)

Strict Mode#

Set NIM_STRICT_ARG_PROCESSING=true to treat override warnings as errors:

export NIM_STRICT_ARG_PROCESSING=true

In strict mode, the container exits with an error if any configuration override is detected between non-default sources. This setting is useful in the following cases:

  • CI/CD pipelines, where you want to catch configuration conflicts.

  • Production deployments, where configuration should be deterministic.

Example:

export NIM_STRICT_ARG_PROCESSING=true
export NIM_TENSOR_PARALLEL_SIZE=2

docker run ... nim-serve --tensor-parallel-size 4

Result:

ERROR: Config override detected in strict mode: 'tensor_parallel_size' changed from 2 (ENV) to 4 (CLI).
Set NIM_STRICT_ARG_PROCESSING=false to allow intentional overrides.

Dry Run#

Use --dry-run with nim-serve to print the fully resolved configuration and the resulting vLLM arguments without starting the server:

docker run --gpus=all \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -e NIM_TENSOR_PARALLEL_SIZE=2 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1 \
  nim-serve --dry-run

This prints the resolved configuration with provenance for each parameter, showing which source provided each value.

Denied Arguments#

Certain vLLM CLI arguments are blocked in NIM containers because the nginx proxy or other system components manage them. If you pass a denied argument, NIM logs a warning and ignores it.

Denied Argument

Reason

NIM Alternative

--host

Networking is managed by nginx

--port

Port is managed by nginx

NIM_SERVER_PORT (external)

--ssl-keyfile

SSL/TLS is managed by nginx

NIM_SSL_MODE, NIM_SSL_KEY_PATH

--ssl-certfile

SSL/TLS is managed by nginx

NIM_SSL_MODE, NIM_SSL_CERTS_PATH

--ssl-ca-certs

SSL/TLS is managed by nginx

NIM_SSL_CA_CERTS_PATH

--ssl-cert-reqs

SSL/TLS is managed by nginx

NIM_SSL_MODE=MTLS

--root-path

API routing is managed by nginx

--api-key

Authentication should be managed externally

--uvicorn-log-level

Logging is managed by NIM

NIM_LOG_LEVEL

--middleware

Middleware is managed by nginx

--allowed-origins

CORS should be configured in nginx

NIM_CORS_ALLOW_ORIGINS

--allowed-methods

CORS should be configured in nginx

NIM_CORS_ALLOW_METHODS

--allowed-headers

CORS should be configured in nginx

NIM_CORS_ALLOW_HEADERS

Validation and Error Handling#

NIM validates configuration at multiple stages and handles errors differently depending on where validation occurs.

Where Validation Occurs#

Stage

Validated Item

Error Behavior

CLI Parsing

Argument format, type conversion

Warning logged, invalid arg skipped

Env Parsing

Type conversion

Warning logged, invalid value skipped

Runtime Config

JSON syntax, type conversion

Warning logged, invalid key skipped

Config Merge

Override detection

Warning (or error in strict mode)

vLLM Startup

Argument validity, model compatibility

vLLM exits with error

Invalid Parameter Handling#

  • Invalid NIM parameters: Logged as warnings; the container continues with the remaining valid values.

  • Unknown parameters: Passed through to vLLM as-is. vLLM handles its own validation.

  • Invalid vLLM parameters: vLLM validates and reports errors at startup.

Example Error Messages#

The following examples show representative error messages:

# NIM-side type error (warning, non-fatal)
WARNING: Failed to parse CLI arg '--tensor-parallel-size' with value 'abc': invalid literal for int()

# vLLM-side validation error (fatal)
ValueError: tensor_parallel_size must be a positive integer

Provenance Tracking#

NIM tracks the source of every configuration parameter. At startup, a structured config_resolved JSON event is emitted showing the provenance of each resolved value:

{
  "tensor_parallel_size": {"value": 4, "source": "CLI"},
  "pipeline_parallel_size": {"value": 1, "source": "PROFILE"},
  "gpu_memory_utilization": {"value": 0.9, "source": "ENV"},
  "enable_prefix_caching": {"value": true, "source": "RUNTIME"}
}

When a value is overridden, the previous value and source are also recorded:

{
  "tensor_parallel_size": {
    "value": 4,
    "source": "CLI",
    "previous_value": 2,
    "previous_source": "ENV"
  }
}

Use nim-serve --dry-run to inspect the full provenance report without starting the server.

GPU Memory Management#

NIM includes automatic GPU memory management to prevent out-of-memory (OOM) failures on constrained hardware. This process uses two mechanisms: automatic clamping of gpu_memory_utilization and a post-selection memory warning.

Automatic GPU Memory Clamping#

On certain hardware configurations, NIM automatically reduces the gpu_memory_utilization parameter, which controls what fraction of GPU memory vLLM can use, to prevent OOM errors:

Condition

Cap

Reason

vGPU guest (partitioned GPU)

0.75

GPU memory is shared or partitioned across virtual machines

UMA device (DGX Spark, GH200)

0.50

CPU and GPU share the same physical memory

Busy GPU (other processes using memory)

free_ratio - 0.05

Prevents contention with other workloads

The clamped value is applied at a low internal priority, so any explicit setting from environment variables, CLI arguments, or --gpu-memory-utilization takes precedence automatically. The minimum floor is 0.10 (NIM never sets GPU memory utilization below 10%).

Post-Selection Memory Warning#

After selecting a profile and resolving all configuration, NIM estimates the total VRAM required for the model (weights, KV cache, activations, and overhead) and compares it to available GPU memory. If the estimate exceeds available memory, NIM logs an advisory warning:

WARNING: Estimated VRAM (45.2 GB) exceeds available GPU memory (39.6 GB).
Consider reducing context length with --max-model-len=4096 (estimated 30.1 GB).

This warning is advisory only. NIM proceeds with startup regardless. The suggested --max-model-len value, if provided, indicates a context length that would fit within available memory.

To apply the suggestion, pass --max-model-len as a CLI argument:

docker run --gpus=all \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1 \
  --max-model-len 4096

SSL/TLS Configuration#

NIM terminates TLS at the nginx proxy layer. Native vLLM SSL arguments such as --ssl-keyfile and --ssl-certfile are denied. Use the NIM SSL variables instead. Refer to Architecture for more information about the proxy layer.

Variable

Description

Default

NIM_SSL_MODE

SSL mode: DISABLED, TLS, or MTLS

DISABLED

NIM_SSL_KEY_PATH

Path to TLS private key

NIM_SSL_CERTS_PATH

Path to TLS certificate

NIM_SSL_CA_CERTS_PATH

Path to CA certificate (required for MTLS)

Use one of the following NIM_SSL_MODE values to enable SSL:

  • TLS mode: Server presents a certificate; client certificate is not required. To activate, set NIM_SSL_MODE=TLS. Requires NIM_SSL_CERTS_PATH and NIM_SSL_KEY_PATH.

  • Mutual TLS mode: Both server and client present certificates. To activate, set NIM_SSL_MODE=MTLS. Requires NIM_SSL_CERTS_PATH, NIM_SSL_KEY_PATH, and NIM_SSL_CA_CERTS_PATH.

The following example configures NIM to use TLS mode:

docker run --gpus=all \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -e NIM_SSL_MODE=TLS \
  -e NIM_SSL_KEY_PATH=/certs/server.key \
  -e NIM_SSL_CERTS_PATH=/certs/server.crt \
  -v /path/to/certs:/certs:ro \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1

CORS Configuration#

CORS is handled by the nginx proxy. The following variables control CORS headers:

Variable

Description

Default

NIM_CORS_ALLOW_ORIGINS

Allowed origins

*

NIM_CORS_ALLOW_METHODS

Allowed HTTP methods

GET, POST, PUT, DELETE, PATCH, OPTIONS

NIM_CORS_ALLOW_HEADERS

Allowed request headers

Content-Type, Authorization, X-Request-Id, X-Session-Id, X-Correlation-Id

NIM_CORS_EXPOSE_HEADERS

Headers exposed to the browser

X-Request-Id

NIM_CORS_MAX_AGE

Preflight cache duration (seconds)

3600

Note

vLLM’s --allowed-origins, --allowed-methods, and --allowed-headers arguments are denied in NIM because CORS is managed at the nginx layer. Use the NIM_CORS_* environment variables instead.

Examples#

The following examples show how configuration precedence, override warnings, and passthrough arguments work in practice.

CLI Overrides an Environment Variable#

The following example shows how a direct CLI argument overrides an environment variable:

export NIM_TENSOR_PARALLEL_SIZE=2

docker run --gpus=all \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -e NIM_TENSOR_PARALLEL_SIZE \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1 \
  nim-serve --tensor-parallel-size 4

This resolves tensor_parallel_size to 4, because the CLI value overrides the environment variable value.

NIM logs the following warning:

WARNING: Config override: 'tensor_parallel_size' changed from 2 (ENV) to 4 (CLI)

Full Priority Chain#

The following example shows how tensor_parallel_size resolves when every source sets a value:

runtime_config.json:        {"tensor_parallel_size": 1, "enable_prefix_caching": true}
Environment:                NIM_TENSOR_PARALLEL_SIZE=2
NIM_PASSTHROUGH_ARGS:       --tensor-parallel-size 3
CLI:                        nim-serve --tensor-parallel-size 4 --no-enable-prefix-caching

The resolved values are:

  • tensor_parallel_size = 4 (CLI overrides passthrough overrides env overrides runtime)

  • enable_prefix_caching = False (CLI --no- overrides runtime true)

Kubernetes with Passthrough Args#

The following Kubernetes example passes vLLM arguments through NIM_PASSTHROUGH_ARGS:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: nim-llm
          image: <NIM_LLM_MODEL_FREE_IMAGE>:2.0.1
          env:
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ngc-secret
                  key: api-key
            - name: NIM_MODEL_PATH
              value: "ngc://nim/meta/llama-3.1-8b-instruct"
            - name: NIM_CACHE_PATH
              value: "/opt/nim/.cache"
            - name: NIM_PASSTHROUGH_ARGS
              value: "--enable-prefix-caching --max-num-batched-tokens 8192 --enable-chunked-prefill"
          ports:
            - containerPort: 8000
          livenessProbe:
            httpGet:
              path: /v1/health/live
              port: 8000
          readinessProbe:
            httpGet:
              path: /v1/health/ready
              port: 8000

Enabling TLS#

The following example enables TLS for a container deployment:

docker run --gpus=all \
  -e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
  -e NIM_SSL_MODE=TLS \
  -e NIM_SSL_KEY_PATH=/certs/server.key \
  -e NIM_SSL_CERTS_PATH=/certs/server.crt \
  -v /path/to/certs:/certs:ro \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1

After the container starts, you can validate the TLS configuration with a command such as the following:

curl --cacert /path/to/certs/ca.crt \
  https://localhost:8000/v1/health/ready