Configure Your NIM with NVIDIA NIM for LLMs#

NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.

GPU Selection#

Passing --gpus all to docker run is acceptable in homogeneous environments with one or more of the same GPU.

Note

--gpus all only works if your configuration has the same number of GPUs as specified for the model in Supported Models for NVIDIA NIM for LLMs. Running an inference on a configuration with fewer or more GPUs can result in a runtime error.

In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:

the --gpus flag (ex: --gpus='"device=1"')
the environment variable NVIDIA_VISIBLE_DEVICES (ex: -e NVIDIA_VISIBLE_DEVICES=1)

The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:

GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)

Refer to the NVIDIA Container Toolkit documentation for more instructions.

How many GPUs do I need?#

Optimized Models#

For models that have been optimized by NVIDIA, there are recommended tensor and pipeline parallelism configurations.

Each Profile will have a TP (Tensor Parallelism) and PP (Pipeline Parallelism), decipherable through their readable name (example: tensorrt_llm-trtllm_buildable-bf16-tp8-pp2).

In most cases, you will need TP * PP amount of GPUs to run a specific profile.

For example, for the profile tensorrt_llm-trtllm_buildable-bf16-tp8-pp2 you will need either 2 nodes with 8 GPUs or 2 * 8 = 16 GPUs on one Node.

Other Models#

For LLM NIM supported models, NIM will attempt to set TP to all the exposed GPUs in the container. Users can set NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE to specify any arbitrary inference configuration.

In most cases, you will need TP * PP amount of GPUs to run a specific profile. For more information about profiles, refer to Model Profiles in NVIDIA NIM for LLMs.

Shared memory flag#

Passing --shm-size=16GB to docker run. Not required for single GPU models or GPUs with NVLink enabled.

Environment Variables#

The following are the environment variables that you can pass to a NIM (-e added to docker run).

Variable	Required?	Default	Notes
`NGC_API_KEY`	Yes	—	Your personal NGC API key for downloading inference containers. For LLM-specific NIMs, set this for downloading NGC models from manifest.
`NIM_CACHE_PATH`	No	`/opt/nim/.cache`	The location in the container where the container caches model artifacts.
`NIM_CUSTOM_GUIDED_DECODING_BACKENDS`	No	`None`	The path to a directory of custom guided decoding backend directories. See Custom Guided Decoding Backends with NVIDIA NIM for LLMs for details.
`NIM_CUSTOM_MODEL_NAME`	No	`None`	The model name given to a locally-built engine. If set, the locally-built engine is named `NIM_CUSTOM_MODEL_NAME` and is cached with the same name in the NIM Cache. This name must be non-duplicate to other cached custom engines. This cached engine will also be visible with the same name with the `list-model-profiles` command and will behave like every other profile. On subsequent docker runs, a locally cached engine will take precedence over every other type of profile. You may also set `NIM_MODEL_PROFILE` to be a specific custom model name to force NIM LLM to serve that cached engine.
`NIM_DISABLE_CUDA_GRAPH`	No	`0`	Set to `1` to disable the use of CUDA graph.
`NIM_DISABLE_LOG_REQUESTS`	No	`1`	Set to `0` to view logs of request details to `v1/completions` and `v1/chat/completions`. These logs contain sensitive attributes of the request including `prompt`, `sampling_params`, and `prompt_token_ids`. You should be aware that these attributes are exposed to the container logs when you set this to `0`.
`NIM_ENABLE_DP_ATTENTION`	No	`0`	`1` to enable DP attention when using SGLang.
`NIM_ENABLE_KV_CACHE_HOST_OFFLOAD`	No	`None`	Set to `1` to enable host-based KV cache offloading, `0` to disable. This only takes effect with the TensorRT-LLM backend and if `NIM_ENABLE_KV_CACHE_REUSE` is set to `1`. Leave unset (`None`) to use the optimal offloading strategy for your system.
`NIM_ENABLE_KV_CACHE_REUSE`	No	`0`	Set to `1` to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference.
`NIM_ENABLE_OTEL`	No	`0`	Set to `1` to enable OpenTelemetry instrumentation in NIMs.
`NIM_ENABLE_PROMPT_LOGPROBS`	No	`0`	Set to `1` to enable a buildable path for context logits generation, allowing echo functionality to work with log probabilities and also enable top_logprobs feature for the response.
`NIM_FORCE_DETERMINISTIC`	No	`0`	Set to `1` to force deterministic builds and enable runtime deterministic behavior. This only takes effect with the TensorRT-LLM backend.
`NIM_GUIDED_DECODING_BACKEND`	No	`"xgrammar"`	The guided decoding backend to use. Can be one of `"xgrammar"`, `"outlines"`, `"lm-format-enforcer"` or a custom guided decoding backend.
`NIM_JSONL_LOGGING`	No	`0`	Set to `1` to enable JSON-formatted logs. Readable text logs are enabled by default.
`NIM_KV_CACHE_HOST_MEM_FRACTION`	No	`0.1`	The fraction of free host memory to use for KV cache host offloading. This only takes effect if `NIM_ENABLE_KV_CACHE_HOST_OFFLOAD` is enabled.
`NIM_LOG_LEVEL`	No	`DEFAULT`	The log level of the NIM for LLMs service. Possible values of the variable are `DEFAULT`, `TRACE`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Mostly, the effect of `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` is described in Python 3 logging docs. `TRACE` log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in `uvicorn`. When `NIM_LOG_LEVEL` is `DEFAULT` sets all log levels to `INFO` except for TRT-LLM log level which equals `ERROR`. When `NIM_LOG_LEVEL` is `CRITICAL` TRT-LLM log level is `ERROR`.
`NIM_LOW_MEMORY_MODE`	No	`0`	Set to `1` to enable offloading the locally-built TRTLLM engines to disk. This reduces runtime host memory requirement, but may increase the startup time and disk usage.
`NIM_MAX_BATCH_SIZE`	No	`None`	The maximum batch size for TRTLLM inference. If unspecified, will be automatically derived from the detected GPUs. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has `trtllm-buildable` equal to `true`. In the case where `trtllm-buildable` is equal to `true` the TRT-LLM build parameter `max_batch_size` will be set to this value.
`NIM_MAX_CPU_LORAS`	No	`16`	The number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than `NIM_MAX_CPU_LORAS` you may see “cache is full” errors. This value must be >= NIM_MAX_GPU_LORAS.
`NIM_MAX_GPU_LORAS`	No	`8`	The number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch.
`NIM_MAX_LORA_RANK`	No	`32`	The maximum LoRA rank.
`NIM_MAX_MODEL_LEN`	No	`None`	The model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has `trtllm-buildable` equal to `true`. In the case where `trtllm-buildable` is equal to `true` the TRT-LLM build parameter `max_seq_len` will be set to this value.
`NIM_MODEL_NAME`	No	“Model Name”	The path to a model directory. For LLM-specific NIMs, set this only if `NIM_MANIFEST_ALLOW_UNSAFE` is set to `1`. For the LLM-agnostic NIM, you can specify HuggingFace paths of the form `hf://<org>/<model-repo>`, such as `hf://meta-llama/Meta-Llama-3-8B`.
`NIM_MODEL_PROFILE`	No	None	Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at `/opt/nim/etc/default/model_manifest.yaml`. If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. To get a list of the compatible profiles, append `list-model-profiles` at the end of the `docker run` command. Using the profile name `default` will select a profile that is maximally compatible and may not be optimal for your hardware.
`NIM_NUM_KV_CACHE_SEQ_LENS`	No	`None`	Set to a value greater than or equal to `1` to override the default KV cache memory allocation settings for NIM LLM. The specified value is used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The maximum sequence length is the context size of the model. `NIM_RELAX_MEM_CONSTRAINTS` must be set to `1` for this environment variable to take effect.
`NIM_OTEL_EXPORTER_OTLP_ENDPOINT`	No	`None`	The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration.
`NIM_OTEL_METRICS_EXPORTER`	No	`console`	Similar to `NIM_OTEL_TRACES_EXPORTER`, but for metrics.
`NIM_OTEL_SERVICE_NAME`	No	`None`	The name of your service, to help with identifying and categorizing data.
`NIM_OTEL_TRACES_EXPORTER`	No	`console`	The OpenTelemetry exporter to use for tracing. Set this flag to `otlp` to export the traces using the OpenTelemetry Protocol. Set it to `console` to print the traces to standard output.
`NIM_PEFT_REFRESH_INTERVAL`	No	`None`	How often to check `NIM_PEFT_SOURCE` for new models, in seconds. If not set, PEFT cache will not refresh. If you choose to enable PEFT refreshing by setting this ENV var, we recommend setting the number greater than 30.
`NIM_PEFT_SOURCE`	No		If you want to enable PEFT inference with local PEFT modules, then set a `NIM_PEFT_SOURCE` environment variable and pass that into the run container command. If your PEFT source is a local directory at `LOCAL_PEFT_DIRECTORY`, mount your local PEFT directory to the container’s PEFT source set by `NIM_PEFT_SOURCE`. Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM.
`NIM_RELAX_MEM_CONSTRAINTS`	No	`0`	If set to `1`, use the value provided in `NIM_NUM_KV_CACHE_SEQ_LENS`. The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and runs the model regardless of memory constraints. It also uses heuristics to determine if GPU will likely meet or fail memory requirements and provides a warning if applicable. If set to `1` and `NIM_NUM_KV_CACHE_SEQ_LENS` not specified then `NIM_NUM_KV_CACHE_SEQ_LENS` is automatically set to `1`.
`NIM_REPOSITORY_OVERRIDE`	No	`None`	If set to a non-empty string, the `NIM_REPOSITORY_OVERRIDE` value replaces the hard-coded location of the repository and the protocol for access to the repository. The structure of the value for this environment variable is as follows: `<repository type>://<repository location>`. Only the protocols `ngc://`, `s3://`, and `https://` are supported, and only the first component of the URI is replaced. For example: - If the URI in the manifest is `ngc://org/meta/llama3-8b-instruct:hf?file=config.json` and `NIM_REPOSITORY_OVERRIDE=ngc://myrepo.ai/`, the domain name for the API endpoint is set to `myrepo.ai`. - If `NIM_REPOSITORY_OVERRIDE=s3://mybucket/`, the result of the replacement will be `s3://mybucket/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json`. - If `NIM_REPOSITORY_OVERRIDE=https://mymodel.ai/some_path_optional`, the result of the replacement will be `https://mymodel.ai/some_path/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json`. This repository override feature supports basic authentication mechanisms: - `https` assumes authorization using the Authorization header and the credential value in `NIM_HTTPS_CREDENTIAL`. - `ngc` requires a credential in the `NGC_API_KEY` environment variable. - `s3` requires the environment variables `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and (if using temporary credentials) `AWS_SESSION_TOKEN`.
`NIM_REWARD_LOGITS_RANGE`	No	None	The range in generation logits to extract reward scores. It should be a comma-separated list of two integers. For example, `"0,1"` means the first Logit is the reward score and “3,5” means 4th and 5th logits are the reward score. Supported in version 1.8 and later.
`NIM_REWARD_MODEL`	No	`0`	Set to `1` to enable reward score collection from the model’s response. Supported in version 1.8 and later.
`NIM_REWARD_MODEL_STRING`	No	None	The reward model string. Supported in version 1.10 and later.
`NIM_SCHEDULER_POLICY`	No	`guarantee_no_evict`	The runtime scheduler policy to use. Possible values: `guarantee_no_evict` or `max_utilization`. Default: `guarantee_no_evict`. Must be set only for the TRTLLM backend. It does not impact any vLLM profiles.
`NIM_SDK_USE_NATIVE_TLS`	No	`0`	Set to `1` to use native TLS stack for downloading from NGC. By default `rustls-tls` is used. `rustls-tls` can fail if a custom CA is used. For using native TLS, please (1) mount the certificate file, (2) provide path to the certificate in `SSL_CERT_FILE`, (3) set `https_proxy` to the address of the proxy.
`NIM_SERVER_PORT`	No	`8000`	Publish the NIM service to the specified port inside the container. Make sure to adjust the port passed to the `-p/--publish` flag of `docker run` to reflect that (ex: `-p $NIM_SERVER_PORT:$NIM_SERVER_PORT`). The left-hand side of this `:` is your host address:port, and does NOT have to match with `$NIM_SERVER_PORT`. The right-hand side of the `:` is the port inside the container which MUST match `NIM_SERVER_PORT` (or `8000` if not set).
`NIM_SSL_CA_CERTS_PATH`	Required if `NIM_SSL_MODE="MTLS"`	`None`	The path to the CA (certificate Authority) certificate.
`NIM_SSL_CERTS_PATH`	Required if `NIM_SSL_MODE` is enabled	`None`	The path to the server’s certificate file (required for TLS HTTPS). It contains the public key and server identification information.
`NIM_SSL_KEY_PATH`	Required if `NIM_SSL_MODE` is enabled	`None`	The path to the server’s TLS private key file (required for TLS HTTPS). It’s used to decrypt incoming messages and sign outgoing ones.
`NIM_SSL_MODE`	No	`"DISABLED"`	Specify a value to enable SSL/TLS in served endpoints or skip environment variables `NIM_SSL_KEY_PATH`, `NIM_SSL_CERTS_PATH`, `NIM_SSL_CA_CERTS_PATH`. Possible values: (1) `"DISABLED"` - no HTTPS, (2) `"TLS"` - HTTPS with only server-side TLS (client certificate not required), (3) `"MTLS"` - HTTPS with mTLS (client certificate required). If `"TLS"` is used, then `NIM_SSL_CERTS_PATH`, `NIM_SSL_KEY_PATH` are required. If `"MTLS"` is used, then `NIM_SSL_CERTS_PATH`, `NIM_SSL_KEY_PATH`, `NIM_SSL_CA_CERTS_PATH` are required.
`NIM_TOKENIZER_MODE`	No	`auto`	The tokenizer mode. `auto` will use the fast tokenizer if available. `slow` will always use the slow tokenizer.
`NIM_TRUST_CUSTOM_CODE`	No	`0`	Set to `1` to enable custom guided decoding backend. This enables arbitrary Python code execution as part of the custom guided decoding.
`NIM_FORCE_TRUST_REMOTE_CODE`	No	`0`	Set this to `1` to make sure models which require the flag `--trust-remote-code` have it turned on when using the LLM NIM. For example, Llama Nemotron Super 49B needs this flag enabled when running the model via the LLM NIM.
`SSL_CERT_FILE`	No	`None`	The path to the SSL certificate used for downloading models when NIM is run behind a proxy. The certificate of the proxy must be used together with `NIM_SDK_USE_NATIVE_TLS` and `https_proxy` environment variables.
`NIM_SDK_MAX_PARALLEL_DOWNLOAD_REQUESTS`	No	`1`	The maximum number of parallel download requests when downloading models.
`NIM_KV_CACHE_PERCENT`	No	`0.9`	Percentage of total GPU memory to allocate for the key-value (KV) cache during model inference. Considering a machine with 80GB of GPU memory, where the model weights occupy 60GB, setting `NIM_KV_CACHE_PERCENT` to 0.9 allocates memory as follows: the KV cache receives 80GB × 0.9 − 60GB = 12GB, and intermediate results receive 80GB × (1.0 − 0.9) = 8GB.
`NIM_USE_TRTLLM_LEGACY_BACKEND`	No	`1`	Controls which TensorRT-LLM backend to use. If `1`, uses the legacy executor API-based backend. If `0`, uses the LLMAPI-based TensorRT-LLM backend when the selected profile is `tensorrt_llm`.
`NIM_DISABLE_TRTLLM_PYTORCH_RT`	No	`1`	When `NIM_USE_TRTLLM_LEGACY_BACKEND` is `0`, this setting controls which TensorRT-LLM backend implementation to use. If set to `1`, the C++ LLMAPI-based backend will be used. If set to `0`, the PyTorch-based backend will be used. This only takes effect when the selected profile is `tensorrt_llm`.

LLM-specific NIM Environment Variables#

For LLM-specific NIMs downloaded from NVIDIA, the following are additional environment variables that can be used to tune the behavior per the above instructions.

Variable	Required?	Default	Notes
`NIM_FT_MODEL`	No		Points to the path of the custom fine-tuned weights in the container.
`NIM_MANIFEST_ALLOW_UNSAFE`	No	`0`	Set to `1` to enable selection of a model profile not included in the original `model_manifest.yaml`. If set, you must also set `NIM_MODEL_NAME` to be the path to the model directory or an NGC path.
`NIM_SERVED_MODEL_NAME`	No	`None`	The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at `/opt/nim/etc/default/model_manifest.yaml`. Note that this name(s) will also be used in `model_name` tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one.

LLM-agnostic NIM Environment Variables#

When using the LLM-agnostic NIM container with other supported models, the following are additional environment variables that can be used to tune the behavior per the above instructions.

Variable	Required?	Default	Notes
`NIM_CHAT_TEMPLATE`	No	`None`	The absolute path to the `.jinja` file that contains the chat template. Useful for instructing the LLM to format the output response in a way that the tool-call parser can understand.
`NIM_ENABLE_AUTO_TOOL_CHOICE`	No	`0`	Set to `1` to enable tool calling functionality.
`NIM_PIPELINE_PARALLEL_SIZE`	No	`None`	NIM will pick the pipeline parallel size that is provided here.
`NIM_TENSOR_PARALLEL_SIZE`	No	`None`	NIM will pick the tensor parallel size that is provided here.
`NIM_TOOL_CALL_PARSER`	No	`None`	How the model post-processes the LLM response text into a tool call data structure. One of : `"pythonic"`, `"mistral"`, `"llama3_json"`, `"granite-20b-fc"`, `"granite"`, `"hermes"`, `"jamba"`, or custom value.
`NIM_TOOL_PARSER_PLUGIN`	No	`None`	The absolute path of a python file that is a custom tool-call parser. Required when `NIM_TOOL_CALL_PARSER` is specified with a custom value.
`NIM_SERVED_MODEL_NAME`	No	`None`	The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the HF URL. For local model paths, NIM will set the absolute path to the local model directory as model name by default. It is highly recommended to set this environment variable in such cases. Note that this name(s) will also be used in `model_name` tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one.

Volumes#

Local paths can be mounted to the following container paths.

Container path	Required?	Notes	Docker argument example
`/opt/nim/.cache` (or `NIM_CACHE_PATH` if present)	No; however, if this volume is not mounted, the container does a fresh download of the model every time the container starts.	This directory is to where models are downloaded inside the container. You can access this directory from within the container by adding the `-u $(id -u)` option to the `docker run` command. For example, to use `~/.cache/nim` as the host machine directory for caching models, first run `mkdir -p ~/.cache/nim` before running the `docker run ...` command.	`-v ~/.cache/nim:/opt/nim/.cache -u $(id -u)`