Configure Your NIM with NVIDIA NIM for LLMs#

NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.

GPU Selection#

Passing --gpus all to docker run is acceptable in homogeneous environments with one or more of the same GPU.

Note

--gpus all only works if your configuration has the same number of GPUs as specified for the model in Supported Models for NVIDIA NIM for LLMs. Running an inference on a configuration with fewer or more GPUs can result in a runtime error.

In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:

the --gpus flag (ex: --gpus='"device=1"')
the environment variable NVIDIA_VISIBLE_DEVICES (ex: -e NVIDIA_VISIBLE_DEVICES=1)

The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:

GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)

Refer to the NVIDIA Container Toolkit documentation for more instructions.

How many GPUs do I need?#

Optimized Models#

For models that have been optimized by NVIDIA (LLM-specific NIMs), there are recommended tensor and pipeline parallelism configurations (see Supported Models for NVIDIA NIM for LLMs for details).

Each Profile will have a TP (Tensor Parallelism) and PP (Pipeline Parallelism), decipherable through their readable name (example: tensorrt_llm-trtllm_buildable-bf16-tp8-pp2).

In most cases, you will need TP * PP amount of GPUs to run a specific profile.

For example, for the profile tensorrt_llm-trtllm_buildable-bf16-tp8-pp2 you will need either 2 nodes with 8 GPUs or 2 * 8 = 16 GPUs on one Node.

Other Models#

For supported models, the multi-LLM compatible NIM will attempt to set TP to all the exposed GPUs in the container (see Supported Architectures for Multi-LLM NIM for details). Users can set NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE to specify any arbitrary inference configuration.

In most cases, you will need TP * PP amount of GPUs to run a specific profile. For more information about profiles, refer to Model Profiles in NVIDIA NIM for LLMs.

Shared memory flag#

Passing --shm-size=16GB to docker run. Not required for single GPU models or GPUs with NVLink enabled.

Environment Variables#

The following are the environment variables that you can pass to a NIM (add -e to docker run).

General#

These environment variables apply to both NIM options. For environment variables that are applicable only to certain NIMs, see Multi-LLM NIM and LLM-specific NIMs.

Authentication#

These variables manage service endpoints and authentication for accessing model data. Configure the service port, provide your NGC API key for authentication, and specify a custom model repository for air-gapped or mirrored environments.

NGC_API_KEY

Your personal NGC API key. Required.

NIM_SERVER_PORT

Publish the NIM service to the specified port inside the container. Make sure to adjust the port passed to the -p/--publish flag of docker run to reflect this (for example, -p $NIM_SERVER_PORT:$NIM_SERVER_PORT). The left-hand side of this : is your host address:port, and does NOT have to match with $NIM_SERVER_PORT. The right-hand side of the : is the port inside the container which MUST match NIM_SERVER_PORT (or 8000 if not set). Default value: 8000

NIM_REPOSITORY_OVERRIDE

If set to a non-empty string, the NIM_REPOSITORY_OVERRIDE value replaces the hard-coded location of the repository and the protocol for access to the repository. The structure of the value for this environment variable is as follows: <repository type>://<repository location>. Only the protocols ngc://, s3://, and https:// are supported, and only the first component of the URI is replaced. Default value: None

If the URI in the manifest is ngc://org/meta/llama3-8b-instruct:hf?file=config.json and NIM_REPOSITORY_OVERRIDE=ngc://myrepo.ai/, the domain name for the API endpoint is set to myrepo.ai.
If NIM_REPOSITORY_OVERRIDE=s3://mybucket/, the result of the replacement will be s3://mybucket/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json.
If NIM_REPOSITORY_OVERRIDE=https://mymodel.ai/some_path_optional, the result of the replacement will be https://mymodel.ai/some_path/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json.

This repository override feature supports basic authentication mechanisms:

https assumes authorization using the Authorization header and the credential value in NIM_HTTPS_CREDENTIAL.
ngc requires a credential in the NGC_API_KEY environment variable.
s3 requires the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and (if using temporary credentials) AWS_SESSION_TOKEN.

NIM_PROXY_CONNECTIVITY_TARGETS

A comma-separated list of host names to verify through the proxy when https_proxy is set. These hosts are tested for connectivity during startup to ensure the proxy allows access to required services. If not set, the default list is used. If set to an empty string, no connectivity checks are performed. If connectivity checks fail, verify that your proxy allows connections to these domains. Default value: authn.nvidia.com,api.ngc.nvidia.com,xfiles.ngc.nvidia.com,huggingface.co,cas-bridge.xethub.hf.co

Model Configuration#

These variables define which model the NIM will serve and how it behaves. Use them to select a model profile, load a custom model, set the context length, manage fine-tuning options, and enable features like deterministic outputs.

NIM_MODEL_PROFILE: Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at /opt/nim/etc/default/model_manifest.yaml. If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. To get a list of the compatible profiles, append list-model-profiles at the end of the docker run command. Using the profile name default, NIM will select a profile that is maximally compatible and may not be optimal for your hardware. Default value: None
NIM_MODEL_NAME: The path to a model directory. For LLM-specific NIMs, set this only if NIM_MANIFEST_ALLOW_UNSAFE is set to 1. For the multi-LLM compatible NIM container, you can specify HuggingFace paths of the form hf://<org>/<model-repo>, such as hf://meta-llama/Meta-Llama-3-8B. Any local directory structure must comply with the model formats supported by the multi-LLM compatible NIM container. Default value: "Model Name"
NIM_SERVED_MODEL_NAME: The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of them. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at /opt/nim/etc/default/model_manifest.yaml. Note that this name will also be used in the model_name tag content of Prometheus metrics. If multiple names are provided, the metrics tag will take the first one. Default value: None
NIM_CUSTOM_MODEL_NAME: The model name given to a locally-built engine. If set, the locally-built engine is named NIM_CUSTOM_MODEL_NAME and is cached with the same name in the NIM cache. The name must be unique among all cached custom engines. This cached engine will also be visible with the same name with the list-model-profiles command and will behave like every other profile. On subsequent docker runs, a locally cached engine will take precedence over every other type of profile. You may also set NIM_MODEL_PROFILE to be a specific custom model name to force NIM LLM to serve that cached engine. Default value: None
NIM_MANIFEST_ALLOW_UNSAFE: Set to 1 to enable selection of a model profile not included in the original model_manifest.yaml. If set, you must also set NIM_MODEL_NAME to be the path to the model directory or an NGC path. Default value: None
NIM_MAX_MODEL_LEN: The model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has trtllm-buildable equal to true. In the case where trtllm-buildable is equal to true, the TRT-LLM build parameter max_seq_len will be set to this value. Default value: None
NIM_MAX_NUM_SEQS: The maximum number of sequences processed in a single iteration. Default value: None
NIM_TOKENIZER_MODE: The tokenizer mode. auto will use the fast tokenizer if available. slow will always use the slow tokenizer. Only set to slow if auto does not work for the model. Default value: auto
NIM_FT_MODEL: Points to the path of the custom fine-tuned weights in the container. Default value: None
NIM_ENABLE_PROMPT_LOGPROBS: Set to 1 to enable a buildable path for context logits generation, allowing echo functionality to work with log probabilities and also enable the top_logprobs feature for the response. Default value: None
NIM_FORCE_DETERMINISTIC: Set to 1 to force deterministic builds and enable runtime deterministic behavior. Deterministic mode is only supported on TRT-LLM buildable profiles. Default value: None

Performance#

These variables allow you to optimize the performance and resource utilization of the NIM. You can tune the maximum batch size, manage GPU memory, and configure scheduling policies to strike the right balance between latency, throughput, and system load.

NIM_MAX_BATCH_SIZE: The maximum batch size for TRTLLM inference. If unspecified, will be automatically derived from the detected GPUs. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has trtllm-buildable equal to true. In the case where trtllm-buildable is equal to true, the TRT-LLM build parameter max_batch_size will be set to this value. Default value: None
NIM_DISABLE_CUDA_GRAPH: Set to 1 to disable the use of CUDA graph. Default value: None
NIM_LOW_MEMORY_MODE: Set to 1 to enable offloading locally-built TRTLLM engines to disk. This reduces runtime host memory requirement, but may increase the startup time and disk usage. Default value: None
NIM_RELAX_MEM_CONSTRAINTS: If set to 1, use the value provided in NIM_NUM_KV_CACHE_SEQ_LENS. The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and runs the model regardless of memory constraints. It also uses heuristics to determine if the GPU will likely meet or fail memory requirements and provides a warning if applicable. If set to 1 and NIM_NUM_KV_CACHE_SEQ_LENS is not specified then NIM_NUM_KV_CACHE_SEQ_LENS is automatically set to 1. Default value: None
NIM_SCHEDULER_POLICY: The runtime scheduler policy to use. The possible values are guarantee_no_evict or max_utilization. Must be set only for the TRTLLM backend. It does not impact any vLLM or SGLang profiles. Default value: guarantee_no_evict
NIM_SDK_MAX_PARALLEL_DOWNLOAD_REQUESTS: The maximum number of parallel download requests when downloading models. Default value: 1

Caching#

These variables control how the NIM caches model artifacts and manages the KV cache for faster inference. You can define the cache location, enable KV cache reuse between requests, and configure host memory offloading to optimize performance for recurring prompts.

NIM_CACHE_PATH: The location in the container where the container caches model artifacts. If this volume is not mounted, the container does a fresh download of the model every time the container starts. Default value: /opt/nim/.cache
NIM_ENABLE_KV_CACHE_REUSE: Set to 1 to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference. Default value: None
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD: Set to 1 to enable host-based KV cache offloading, 0 to disable. This only takes effect with the TensorRT-LLM backend and if NIM_ENABLE_KV_CACHE_REUSE is set to 1. Leave unset (None) to use the optimal offloading strategy for your system. Default value: None
NIM_KV_CACHE_HOST_MEM_FRACTION: The fraction of free host memory to use for KV cache host offloading. This only takes effect if NIM_ENABLE_KV_CACHE_HOST_OFFLOAD is enabled. Default value: 0.1
NIM_KVCACHE_PERCENT: Percentage of total GPU memory to allocate for the key-value (KV) cache during model inference. Considering a machine with 80GB of GPU memory, where the model weights occupy 60GB, setting NIM_KVCACHE_PERCENT to 0.9 allocates memory as follows: the KV cache receives 80GB × 0.9 − 60GB = 12GB, and intermediate results receive 80GB × (1.0 − 0.9) = 8GB. Default value: 0.9
NIM_NUM_KV_CACHE_SEQ_LENS: Set to a value greater than or equal to 1 to override the default KV cache memory allocation settings for NIM LLM. The specified value is used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The maximum sequence length is the context size of the model. NIM_RELAX_MEM_CONSTRAINTS must be set to 1 for this environment variable to take effect. Default value: None

PEFT and LoRA#

These variables enable you to serve models with Parameter-Efficient Fine-Tuning (PEFT) LoRA adapters for customized inference. You can specify the source for LoRA modules, configure automatic refreshing, and set limits on the number of adapters that can be loaded into GPU and CPU memory.

NIM_PEFT_SOURCE: If you want to enable PEFT inference with local PEFT modules, then set a NIM_PEFT_SOURCE environment variable and pass that into the run container command. If your PEFT source is a local directory at LOCAL_PEFT_DIRECTORY, mount your local PEFT directory to the container at the path specified by NIM_PEFT_SOURCE. Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM. Default value: None
NIM_PEFT_REFRESH_INTERVAL: How often to check NIM_PEFT_SOURCE for new and removed models, in seconds. If not set, PEFT cache will not refresh. When enabled, new LoRA adapters are made available and removed adapters become inaccessible for new inference requests without requiring a NIM restart. If you choose to enable PEFT refreshing by setting this environment variable, we recommend setting the number greater than 30. Default value: None
NIM_MAX_GPU_LORAS: The number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch. Default value: 8
NIM_MAX_CPU_LORAS: The number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than NIM_MAX_CPU_LORAS you may see “cache is full” errors. This value must be >= NIM_MAX_GPU_LORAS. Default value: 16
NIM_MAX_LORA_RANK: The maximum LoRA rank. Default value: 32

SSL / HTTPS#

These variables control SSL/TLS configuration for secure connections to and from the NIM service, and secure model downloads. Use these variables to enable HTTPS, configure certificates, and manage proxy and certificate authority settings.

NIM_SSL_MODE

Specify a value to enable SSL/TLS in served endpoints, or skip the environment variables NIM_SSL_KEY_PATH, NIM_SSL_CERTS_PATH, NIM_SSL_CA_CERTS_PATH. Default value: "DISABLED"

The possible values are as follows:

"DISABLED": No HTTPS
"TLS": HTTPS with only server-side TLS (client certificate not required). TLS requires NIM_SSL_CERTS_PATH and NIM_SSL_KEY_PATH to be set.
"MTLS": HTTPS with mTLS (client certificate required). MTLS requires NIM_SSL_CERTS_PATH, NIM_SSL_KEY_PATH, and NIM_SSL_CA_CERTS_PATH to be set.

NIM_SSL_KEY_PATH

The path to the server’s TLS private key file (required for TLS HTTPS). It’s used to decrypt incoming messages and sign outgoing ones. Required if NIM_SSL_MODE is enabled. Default value: None

NIM_SSL_CERTS_PATH

The path to the server’s certificate file (required for TLS HTTPS). It contains the public key and server identification information. Required if NIM_SSL_MODE is enabled. Default value: None

NIM_SSL_CA_CERTS_PATH

The path to the CA (Certificate Authority) certificate. Required if NIM_SSL_MODE="MTLS". Default value: None

SSL_CERT_FILE

The path to the SSL certificate used for downloading models when NIM is run behind a proxy. The certificate of the proxy must be used together with the https_proxy environment variable. Default value: None

Structured Generation#

These variables control structured generation features, allowing you to enforce specific output formats like JSON or follow a defined grammar. You can select a built-in decoding backend or provide your own custom backend for advanced use cases.

NIM_GUIDED_DECODING_BACKEND: The guided decoding backend to use. Can be one of "xgrammar", "outlines", "lm-format-enforcer" or a custom guided decoding backend. Note: When using SGLang profiles, only “xgrammar” and “outlines” backends are supported. Default value: "xgrammar"
NIM_CUSTOM_GUIDED_DECODING_BACKENDS: The path to a directory of custom guided decoding backend directories see custom guided decoding backend for details. Default value: None
NIM_TRUST_CUSTOM_CODE: Set to 1 to enable custom guided decoding backend. This enables arbitrary Python code execution as part of the custom guided decoding. Default value: None

Reward Models#

These variables control the use of reward models to evaluate or score model-generated responses. You can enable the reward model, specify which model to use, and define the range of logits used for scoring.

NIM_REWARD_MODEL: Set to 1 to enable reward score collection from the model’s response. Default value: None
NIM_REWARD_MODEL_STRING: The reward model string. Default value: None
NIM_REWARD_LOGITS_RANGE: The range in generation logits to extract reward scores. It should be a comma-separated list of two integers. For example, "0,1" means the first logit is the reward score and “3,5” means 4th and 5th logits are the reward score. Default value: None

Logging#

These variables control how the NIM service generates logs. You can adjust the log verbosity, switch to a machine-readable JSON format, and control whether request details are logged.

NIM_LOG_LEVEL: The log level of the NIM for LLMs service. Possible values are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, and CRITICAL. The effects of DEBUG, INFO, WARNING, ERROR, and CRITICAL are described in the Python 3 logging docs. The TRACE log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in uvicorn. When NIM_LOG_LEVEL is DEFAULT, all log levels are set to INFO, except for TRT-LLM log level, which is set to ERROR. When NIM_LOG_LEVEL is CRITICAL, the TRT-LLM log level is set to ERROR. Default value: DEFAULT
NIM_JSONL_LOGGING: Set to 1 to enable JSON-formatted logs. By default, human-readable text logs are enabled. Default value: None
NIM_DISABLE_LOG_REQUESTS: Set to 0 to view logs of request details to v1/completions and v1/chat/completions. These logs contain sensitive attributes of the request including prompt, sampling_params, and prompt_token_ids. You should be aware that these attributes are exposed to the container logs when you set this to 0. Default value: 1

OpenTelemetry#

These variables configure OpenTelemetry integration for observability platforms. You can enable OpenTelemetry instrumentation and configure exporters to send tracing and metrics data to your preferred monitoring solution.

NIM_ENABLE_OTEL: Set to 1 to enable OpenTelemetry instrumentation in NIMs. Default value: None
NIM_OTEL_TRACES_EXPORTER: The OpenTelemetry exporter to use for tracing. Set this flag to otlp to export the traces using the OpenTelemetry Protocol (OTLP). Set it to console to print the traces to the standard output. Default value: console
NIM_OTEL_METRICS_EXPORTER: Similar to NIM_OTEL_TRACES_EXPORTER, but for metrics. Default value: console
NIM_OTEL_SERVICE_NAME: The name of your service, to help with identifying and categorizing data. Default value: None
NIM_OTEL_EXPORTER_OTLP_ENDPOINT: The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration. Default value: None

Multi-LLM NIM#

When using the multi-LLM compatible NIM container with other supported models, the following are additional environment variables that can be used to tune the behavior per the above instructions.

NIM_CHAT_TEMPLATE: The absolute path to the .jinja file that contains the chat template. Useful for instructing the LLM to format the output response in a way that the tool-call parser can understand. Default value: None
NIM_ENABLE_AUTO_TOOL_CHOICE: Set to 1 to enable tool calling functionality. Default value: 0
NIM_PIPELINE_PARALLEL_SIZE: NIM will pick the pipeline parallel size that is provided here. Default value: None
NIM_TENSOR_PARALLEL_SIZE: NIM will pick the tensor parallel size that is provided here. Default value: None
NIM_TOOL_CALL_PARSER: How the model post-processes the LLM response text into a tool call data structure. Possible values are "pythonic", "mistral", "llama3_json", "granite-20b-fc", "granite", "hermes", "jamba", or a custom value. Default value: None
NIM_TOOL_PARSER_PLUGIN: The absolute path of a python file that is a custom tool-call parser. Required when NIM_TOOL_CALL_PARSER is specified with a custom value. Default value: None
NIM_SERVED_MODEL_NAME: The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the HF URL. For local model paths, NIM will set the absolute path to the local model directory as model name by default. It is highly recommended to set this environment variable in such cases. Note that this name(s) will also be used in model_name tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one. Default value: None
NIM_FORCE_TRUST_REMOTE_CODE: Set this to 1 to make sure models which require the flag --trust-remote-code have it turned on when using the multi-LLM NIM. For example, Llama Nemotron Super 49B needs this flag enabled when running the model via the multi-LLM NIM. Default value: None

LLM-specific NIMs#

For LLM-specific NIMs downloaded from NVIDIA, the following are additional environment variables that can be used to tune the behavior per the above instructions.

NIM_FT_MODEL: Points to the path of the custom fine-tuned weights in the container. Default value: None
NIM_MANIFEST_ALLOW_UNSAFE: Set to 1 to enable selection of a model profile not included in the original model_manifest.yaml. If set, you must also set NIM_MODEL_NAME to be the path to the model directory or an NGC path. Default value: 0
NIM_SERVED_MODEL_NAME: The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at /opt/nim/etc/default/model_manifest.yaml. Note that this name(s) will also be used in model_name tag content of Prometheus metrics; if multiple names are provided, the metrics tag will take the first one Default value: None

Volumes#

These settings define how to mount local file system paths into the NIM container.

/opt/nim/.cache

This is the default directory where models are downloaded and cached inside the container. Mount a directory from your host machine to this path to preserve the cache between container runs. If this volume is not mounted, the container will download the model every time it starts. You can customize this path with the NIM_CACHE_PATH environment variable.

For example, to use ~/.cache/nim on your host machine as the cache directory:

Create the directory on your host: mkdir -p ~/.cache/nim
Mount the directory by running the docker run command with the -v and -u options: docker run ... -v ~/.cache/nim:/opt/nim/.cache -u $(id -u)