Configure Your NIM with NVIDIA NIM for LLMs#
NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.
GPU Selection#
Passing --gpus all to docker run is acceptable in homogeneous environments with one or more of the same GPU.
Note
--gpus all only works if your configuration has the same number of GPUs as specified for the model in Supported Models for NVIDIA NIM for LLMs.
Running inference on a configuration with fewer or more GPUs can result in a runtime error.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
the
--gpusflag (for example:--gpus='"device=1"')the environment variable
NVIDIA_VISIBLE_DEVICES(for example:-e NVIDIA_VISIBLE_DEVICES=1)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to the NVIDIA Container Toolkit documentation for more instructions.
How Many GPUs Do I Need?#
Optimized Models#
For models that have been optimized by NVIDIA (LLM-specific NIMs), there are recommended tensor and pipeline parallelism configurations (refer to Supported Models for NVIDIA NIM for LLMs for details).
Each Profile will have a TP (Tensor Parallelism) and PP (Pipeline Parallelism), decipherable through their readable name (example: tensorrt_llm-trtllm_buildable-bf16-tp8-pp2).
In most cases, you will need TP * PP amount of GPUs to run a specific profile.
For example, for the profile tensorrt_llm-trtllm_buildable-bf16-tp8-pp2 you will need either 2 nodes with 8 GPUs or 2 * 8 = 16 GPUs on one Node.
Other Models#
For supported models, the multi-LLM compatible NIM will attempt to set TP to all the exposed GPUs in the container (refer to Supported Architectures for Multi-LLM NIM for details). Users can set NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE to specify any arbitrary inference configuration.
In most cases, you will need TP * PP amount of GPUs to run a specific profile. For more information about profiles, refer to Model Profiles in NVIDIA NIM for LLMs.
Environment Variables#
The following are the environment variables that you can pass to a NIM (add -e to docker run).
General#
These environment variables apply to both NIM options. For environment variables that are applicable only to certain NIMs, see Multi-LLM NIM and LLM-specific NIMs.
Remote Model Repository#
These variables manage downloading of model data. Provide your NGC API key for authentication, and specify a custom model repository for air-gapped or mirrored environments.
NGC_API_KEYYour personal NGC API key. Required.
HF_TOKENA Hugging Face token which is used for download models from Hugging Face. Look into Hugging Face Hub documentation for more information. The value must be a non-empty string. Default value:
None.HF_ENDPOINTThe Hugging Face Hub base URL. You might want to set this variable if your organization is using a Private Hub. Look into Hugging Face Hub documentation for more information. Default value:
None.AWS_ENDPOINT_URLThe Oracle Cloud Infrastructure (OCI) Object Storage S3-compatible endpoint. For example,
AWS_ENDPOINT_URL=https://<namespace>.compat.objectstorage.<region>.oraclecloud.com.AWS_REGIONSpecifies the AWS region used as part of the credentials to authenticate the user. See AWS documentation for more information. Can be used for downloading models from AWS. See
NIM_REPOSITORY_OVERRIDEvariable description. The value must be a non-empty string. Default value:None.AWS_ACCESS_KEY_IDSpecifies the AWS access key used as part of the credentials to authenticate the user. See AWS documentation for more information. Can be used for downloading models from AWS. See
NIM_REPOSITORY_OVERRIDEvariable description. The value must be a non-empty string. Default value:None.AWS_SECRET_ACCESS_KEYSpecifies the AWS access key used as part of the credentials to authenticate the user. See AWS documentation for more information. Can be used for downloading models from AWS. See
NIM_REPOSITORY_OVERRIDEvariable description. The value must be a non-empty string. Default value:None.AWS_SESSION_TOKENSpecifies an AWS session token used as part of the credentials to authenticate the user. See AWS documentation for more information. Can be used for downloading models from AWS. See
NIM_REPOSITORY_OVERRIDEvariable description. The value must be a non-empty string. Default value:None.NIM_REPOSITORY_OVERRIDEIf set to a non-empty string, the
NIM_REPOSITORY_OVERRIDEvalue replaces the hard-coded location of the repository and the protocol for access to the repository. The structure of the value for this environment variable is as follows:<repository type>://<repository location>. Only the protocolsngc://,s3://, andhttps://are supported, and only the first component of the URI is replaced. Default value:NoneIf the URI in the manifest is
ngc://org/meta/llama3-8b-instruct:hf?file=config.jsonandNIM_REPOSITORY_OVERRIDE=ngc://myrepo.ai/, the domain name for the API endpoint is set tomyrepo.ai.If
NIM_REPOSITORY_OVERRIDE=s3://mybucket/, the result of the replacement will bes3://mybucket/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json.If
NIM_REPOSITORY_OVERRIDE=https://mymodel.ai/some_path_optional, the result of the replacement will behttps://mymodel.ai/some_path/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json.
This repository override feature supports basic authentication mechanisms:
httpsassumes authorization using the Authorization header and the credential value inNIM_HTTPS_CREDENTIAL.ngcrequires a credential in theNGC_API_KEYenvironment variable.s3requires the environment variablesAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, and (if using temporary credentials)AWS_SESSION_TOKEN.
Local Model Cache#
These variables manage location of the model cache. Downloaded models and custom profiles are saved into the cache. Mount a cache volume and provide path to the volume in the NIM_CACHE_PATH environment variable.
NIM_CACHE_PATHThe location in the container where the container caches model artifacts. If this volume is not mounted, the container does a fresh download of the model every time the container starts. Default value:
/opt/nim/.cacheHF_HOMEThe location of the cached files downloaded from Hugging Face. Default value:
${NIM_CACHE_PATH}/huggingface/hub
HTTPS Proxy#
These variables manage downloading of model data through a proxy. Provide the proxy URL, and the self-signed certificate before the startup.
https_proxyThe URL of the proxy through which outgoing requests are routed. Can be used for downloading of a model if NIM is run behind a corporate proxy. Default value:
None.SSL_CERT_FILEThe path to the SSL certificate used for downloading models when NIM is run behind a proxy. The certificate of the proxy must be used together with the
https_proxyenvironment variable. Default value:NoneNIM_PROXY_CONNECTIVITY_TARGETSA comma-separated list of host names to verify through the proxy when
https_proxyis set. These hosts are tested for connectivity during startup to ensure the proxy allows access to required services. If not set, the default list is used. If set to an empty string, no connectivity checks are performed. If connectivity checks fail, verify that your proxy allows connections to these domains. Default value:authn.nvidia.com,api.ngc.nvidia.com,xfiles.ngc.nvidia.com,huggingface.co,cas-bridge.xethub.hf.co
Model Configuration#
These variables define which model the NIM will serve and how it behaves. Use them to select a model profile, load a custom model, set the context length, manage fine-tuning options, and enable features like deterministic outputs.
NIM_MODEL_PROFILEOverride the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at /opt/nim/etc/default/model_manifest.yaml. If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. To get a list of the compatible profiles, append list-model-profiles at the end of the docker run command. Using the profile name default, NIM will select a profile that is maximally compatible and may not be optimal for your hardware. Default value:
NoneNIM_MODEL_NAMEThe path to a model directory. The behavior depends on the NIM type:
LLM-specific NIMs: Use this variable only if
NIM_MANIFEST_ALLOW_UNSAFEis set to1. Any local directory structure must comply with the model formats supported by the multi-LLM compatible NIM container.Multi-LLM compatible NIM container: You can set
NIM_MODEL_NAMEto one of the following paths:HuggingFace:
hf://<org>/<model-repo>(for example:hf://meta-llama/Meta-Llama-3-8B)Oracle Cloud Infrastructure using the Amazon S3 Compatibility API:
s3repo://<org>/<model-repo>[:<version>](for example:s3repo://meta-llama/Meta-Llama-3-8Bors3repo://meta-llama/Meta-Llama-3-8B:1.14)s3repo://<bucket>/<org>/<model-name>[:<version>](for example:s3repo://llama-models/meta/llama-3.1-8bors3repo://llama-models/meta/llama-3.1-8b:1.14)
Default value:
"Model Name"NIM_SERVED_MODEL_NAMEThe model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of them. The model name in the model field of a response will match the name in the request. If not specified, the model name will be inferred from the manifest located at
/opt/nim/etc/default/model_manifest.yaml. Note that this name will also be used in themodel_nametag content of Prometheus metrics. If multiple names are provided, the metrics tag will take the first one. Default value:NoneNIM_SERVER_PORTPublish the NIM service to the specified port inside the container. Make sure to adjust the port passed to the
-p/--publishflag ofdocker runto reflect this (for example,-p $NIM_SERVER_PORT:$NIM_SERVER_PORT). The left-hand side of this:is your host address:port, and does NOT have to match with$NIM_SERVER_PORT. The right-hand side of the:is the port inside the container which MUST matchNIM_SERVER_PORT(or8000if not set). Default value:8000NIM_CUSTOM_MODEL_NAMEThe model name given to a locally-built engine. If set, the locally-built engine is named
NIM_CUSTOM_MODEL_NAMEand is cached with the same name in the NIM cache. The name must be unique among all cached custom engines. This cached engine will also be visible with the same name with thelist-model-profilescommand and will behave like every other profile. On subsequent docker runs, a locally cached engine will take precedence over every other type of profile. You may also setNIM_MODEL_PROFILEto be a specific custom model name to force NIM LLM to serve that cached engine. Default value:NoneNIM_MANIFEST_ALLOW_UNSAFESet to
1to enable selection of a model profile not included in the originalmodel_manifest.yaml. If set, you must also setNIM_MODEL_NAMEto be the path to the model directory or an NGC path. Default value:NoneNIM_MAX_MODEL_LENThe model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has
trtllm-buildableequal totrue. In the case wheretrtllm-buildableis equal totrue, the TRT-LLM build parametermax_seq_lenwill be set to this value. Default value:NoneNIM_MAX_NUM_SEQSThe maximum number of sequences processed in a single iteration. Default value:
NoneNIM_TOKENIZER_MODEThe tokenizer mode.
autowill use the fast tokenizer if available.slowwill always use the slow tokenizer. Only set to slow if auto does not work for the model. Default value:autoNIM_FT_MODELPoints to the path of the custom fine-tuned weights in the container. Default value:
NoneNIM_ENABLE_PROMPT_LOGPROBSSet to
1to enable a buildable path for context logits generation, allowing echo functionality to work with log probabilities and also enable thetop_logprobsfeature for the response. Default value:NoneNIM_FORCE_DETERMINISTICSet to
1to force deterministic builds and enable runtime deterministic behavior. Deterministic mode is only supported on TRT-LLM buildable profiles. Default value:None
Performance and Resource Management#
These variables allow you to optimize the performance and resource utilization of the NIM. You can tune the maximum batch size, manage GPU memory, and configure scheduling policies to strike the right balance between latency, throughput, and system load.
NIM_MAX_BATCH_SIZEThe maximum batch size for TRTLLM inference. If unspecified, will be automatically derived from the detected GPUs. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has
trtllm-buildableequal totrue. In the case wheretrtllm-buildableis equal totrue, the TRT-LLM build parametermax_batch_sizewill be set to this value. Default value:NoneNIM_DISABLE_CUDA_GRAPHSet to
1to disable the use of CUDA graph. Default value:NoneNIM_LOW_MEMORY_MODESet to
1to enable offloading locally-built TRTLLM engines to disk. This reduces runtime host memory requirement, but may increase the startup time and disk usage. Default value:NoneNIM_RELAX_MEM_CONSTRAINTSIf set to
1, use the value provided inNIM_NUM_KV_CACHE_SEQ_LENS. The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and runs the model regardless of memory constraints. It also uses heuristics to determine if the GPU will likely meet or fail memory requirements and provides a warning if applicable. If set to1andNIM_NUM_KV_CACHE_SEQ_LENSis not specified thenNIM_NUM_KV_CACHE_SEQ_LENSis automatically set to1. Default value:NoneNIM_SCHEDULER_POLICYThe runtime scheduler policy to use. The possible values are
guarantee_no_evictormax_utilization. Must be set only for the TRTLLM backend. It does not impact any vLLM or SGLang profiles. Default value:guarantee_no_evictNIM_SDK_MAX_PARALLEL_DOWNLOAD_REQUESTSThe maximum number of parallel download requests when downloading models. Default value:
1NIM_ENABLE_KV_CACHE_REUSESet to
1to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference. Default value:NoneNIM_ENABLE_KV_CACHE_HOST_OFFLOADSet to
1to enable host-based KV cache offloading,0to disable. This only takes effect with the TensorRT-LLM backend and ifNIM_ENABLE_KV_CACHE_REUSEis set to1. Leave unset (None) to use the optimal offloading strategy for your system. Default value:NoneNIM_KV_CACHE_HOST_MEM_FRACTIONThe fraction of free host memory to use for KV cache host offloading. This only takes effect if
NIM_ENABLE_KV_CACHE_HOST_OFFLOADis enabled. Default value:0.1NIM_KVCACHE_PERCENTPercentage of total GPU memory to allocate for the key-value (KV) cache during model inference. Considering a machine with 80GB of GPU memory, where the model weights occupy 60GB, setting
NIM_KVCACHE_PERCENTto 0.9 allocates memory as follows: the KV cache receives 80GB × 0.9 − 60GB = 12GB, and intermediate results receive 80GB × (1.0 − 0.9) = 8GB. Default value:0.9NIM_NUM_KV_CACHE_SEQ_LENSSet to a value greater than or equal to
1to override the default KV cache memory allocation settings for NIM LLM. The specified value is used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The maximum sequence length is the context size of the model.NIM_RELAX_MEM_CONSTRAINTSmust be set to1for this environment variable to take effect. Default value:None
PEFT and LoRA#
These variables enable you to serve models with Parameter-Efficient Fine-Tuning (PEFT) LoRA adapters for customized inference. You can specify the source for LoRA modules, configure automatic refreshing, and set limits on the number of adapters that can be loaded into GPU and CPU memory.
NIM_PEFT_SOURCEIf you want to enable PEFT inference with local PEFT modules, then set a
NIM_PEFT_SOURCEenvironment variable and pass that into the run container command. If your PEFT source is a local directory atLOCAL_PEFT_DIRECTORY, mount your local PEFT directory to the container at the path specified byNIM_PEFT_SOURCE. Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM. Default value:NoneNIM_PEFT_REFRESH_INTERVALHow often to check
NIM_PEFT_SOURCEfor new and removed models, in seconds. If not set, PEFT cache will not refresh. When enabled, new LoRA adapters are made available and removed adapters become inaccessible for new inference requests without requiring a NIM restart. If you choose to enable PEFT refreshing by setting this environment variable, we recommend setting the number greater than 30. Default value:NoneNIM_MAX_GPU_LORASThe number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch. Default value:
8NIM_MAX_CPU_LORASThe number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than
NIM_MAX_CPU_LORASyou may see “cache is full” errors. This value must be >=NIM_MAX_GPU_LORAS. Default value:16NIM_MAX_LORA_RANKThe maximum LoRA rank. Default value:
32
SSL/TLS Configuration#
These variables control SSL/TLS configuration for secure connections to and from the NIM service. Use these variables to enable HTTPS, configure certificates, and manage proxy and certificate authority settings.
NIM_SSL_MODESpecify a value to enable SSL/TLS in served endpoints, or skip the environment variables
NIM_SSL_KEY_PATH,NIM_SSL_CERTS_PATH,NIM_SSL_CA_CERTS_PATH. Default value:"DISABLED"The possible values are as follows:
"DISABLED": No HTTPS"TLS": HTTPS with only server-side TLS (client certificate not required).TLSrequiresNIM_SSL_CERTS_PATHandNIM_SSL_KEY_PATHto be set."MTLS": HTTPS with mTLS (client certificate required).MTLSrequiresNIM_SSL_CERTS_PATH,NIM_SSL_KEY_PATH, andNIM_SSL_CA_CERTS_PATHto be set.
NIM_SSL_KEY_PATHThe path to the server’s TLS private key file (required for TLS HTTPS). It’s used to decrypt incoming messages and sign outgoing ones. Required if
NIM_SSL_MODEis enabled. Default value:NoneNIM_SSL_CERTS_PATHThe path to the server’s certificate file (required for TLS HTTPS). It contains the public key and server identification information. Required if
NIM_SSL_MODEis enabled. Default value:NoneNIM_SSL_CA_CERTS_PATHThe path to the CA (Certificate Authority) certificate. Required if
NIM_SSL_MODE="MTLS". Default value:None
Structured Generation#
These variables control structured generation features, allowing you to enforce specific output formats like JSON or follow a defined grammar. You can select a built-in decoding backend or provide your own custom backend for advanced use cases.
NIM_GUIDED_DECODING_BACKENDThe guided decoding backend to use. Can be one of
"auto","guidance","xgrammar","outlines","lm-format-enforcer"or a custom guided decoding backend. Note: When using SGLang profiles, only “xgrammar” and “outlines” backends are supported. Default value:"xgrammar"NIM_CUSTOM_GUIDED_DECODING_BACKENDSThe path to a directory of custom guided decoding backend directories see custom guided decoding backend for details. Default value:
NoneNIM_TRUST_CUSTOM_CODESet to
1to enable custom guided decoding backend. This enables arbitrary Python code execution as part of the custom guided decoding. Default value:None
Thinking Budget Control#
Thinking budget control is a feature that caps the number of “thinking” tokens a model may produce. Refer to the dedicated guide, Thinking Budget Control, for usage details.
NIM_ENABLE_BUDGET_CONTROLMaster switch. Set to
1to register the built-inbudget_controlbackend. Default value:NoneNIM_BUDGET_CONTROL_THINKING_START_STRINGThe string that begins the thinking phase. Leave unset to use the default
<think>tag. Default value:<think>If the model’s chat template already includes it, it should be set to an empty string"".NIM_BUDGET_CONTROL_THINKING_STOP_STRINGThe string that terminates the thinking phase. Leave unset to use the default
</think>tag. Default value:</think>
Reward Models#
These variables control the use of reward models to evaluate or score model-generated responses. You can enable the reward model, specify which model to use, and define the range of logits used for scoring.
NIM_REWARD_MODELSet to
1to enable reward score collection from the model’s response. Default value:NoneNIM_REWARD_MODEL_STRINGThe reward model string. Default value:
NoneNIM_REWARD_LOGITS_RANGEThe range in generation logits to extract reward scores. It should be a comma-separated list of two integers. For example,
"0,1"means the first logit is the reward score and “3,5” means 4th and 5th logits are the reward score. Default value:None
Logging#
These variables control how the NIM service generates logs. You can adjust the log verbosity, switch to a machine-readable JSON format, and control whether request details are logged.
NIM_LOG_LEVELThe log level of the NIM for LLMs service. Possible values are
DEFAULT,TRACE,DEBUG,INFO,WARNING,ERROR, andCRITICAL. The effects ofDEBUG,INFO,WARNING,ERROR, andCRITICALare described in the Python 3 logging docs. TheTRACElog level enables printing of diagnostic information for debugging purposes in TRT-LLM and inuvicorn. WhenNIM_LOG_LEVELisDEFAULT, all log levels are set toINFO, except for TRT-LLM log level, which is set toERROR. WhenNIM_LOG_LEVELisCRITICAL, the TRT-LLM log level is set toERROR. Default value:DEFAULTNIM_JSONL_LOGGINGSet to
1to enable JSON-formatted logs. By default, human-readable text logs are enabled. Default value:NoneNIM_DISABLE_LOG_REQUESTSSet to
0to view logs of request details tov1/completionsandv1/chat/completions. These logs contain sensitive attributes of the request includingprompt,sampling_params, andprompt_token_ids. You should be aware that these attributes are exposed to the container logs when you set this to0. Default value:1
OpenTelemetry#
These variables configure OpenTelemetry integration for observability platforms. You can enable OpenTelemetry instrumentation and configure exporters to send tracing and metrics data to your preferred monitoring solution.
NIM_ENABLE_OTELSet to
1to enable OpenTelemetry instrumentation in NIMs. Default value:NoneOTEL_TRACES_EXPORTERorNIM_OTEL_TRACES_EXPORTER(deprecated)The OpenTelemetry exporter to use for tracing. Set any of these flags to
otlpto export the traces using the OpenTelemetry Protocol (OTLP). Set it toconsoleto print the traces to the standard output. Default value:console. The OTLP traces endpoint can be set using any of the following environment variables (in precedence order):OTEL_EXPORTER_OTLP_TRACES_ENDPOINTEndpoint URL for trace data only, with an optionally-specified port number. Typically ends withv1/traceswhen using OTLP/HTTP. Default value:NoneOTEL_EXPORTER_OTLP_ENDPOINTA base endpoint URL for any signal type, with an optionally-specified port number. Helpful for when you’re sending more than one signal to the same endpoint and want one environment variable to control the endpoint. Default value:NoneNIM_OTEL_EXPORTER_OTLP_ENDPOINT(deprecated) Same asOTEL_EXPORTER_OTLP_ENDPOINT. Was the only option in older NIM versions. It is recommended to useOTEL_EXPORTER_OTLP_ENDPOINT. Default value:None
OTEL_METRICS_EXPORTERorNIM_OTEL_METRICS_EXPORTER(deprecated)The OpenTelemetry exporter to use for metrics. Set any of these flags to
otlpto export the metrics using the OpenTelemetry Protocol (OTLP). Set it toconsoleto print the traces to the standard output. Default value:console. The OTLP metrics endpoint can be set using any of the following environment variables (in precedence order):OTEL_EXPORTER_OTLP_METRICS_ENDPOINT,OTEL_EXPORTER_OTLP_ENDPOINT,NIM_OTEL_EXPORTER_OTLP_ENDPOINT.OTEL_EXPORTER_OTLP_METRICS_ENDPOINTEndpoint URL for metric data only, with an optionally-specified port number. Typically ends withv1/metricswhen using OTLP/HTTP. Default value:None
OTEL_LOGS_EXPORTERorNIM_OTEL_LOGS_EXPORTER(deprecated)Similar to traces and metrics exporters, but
OTEL_EXPORTER_OTLP_LOGS_ENDPOINTis not supported. Set any of these flags tootlpto export the logs using the OpenTelemetry Protocol (OTLP). IfOTEL_EXPORTER_OTLP_LOGS_ENDPOINTis set, then an exception is raised which will terminate the NIM container. Default value:console. The OTLP logs endpoint can be set usingOTEL_EXPORTER_OTLP_ENDPOINTor NIM_OTEL_EXPORTER_OTLP_ENDPOINT`.NIM_OTEL_SERVICE_NAMEThe name of your service, to help with identifying and categorizing data. Default value:
None
Experimental#
Important
The per-request metrics feature and the TRT-LLM PyTorch backend are experimental and subject to change in future releases.
TRT-LLM PyTorch Backend#
The following environment variables control experimental support for the TRT-LLM PyTorch backend, which enables compatibility with a wider range of models and may offer performance improvements.
NIM_DISABLE_TRTLLM_PYTORCH_RTSet to
0to enable the TRT-LLM PyTorch RT backend. Additionally, this feature requiresNIM_USE_TRTLLM_LEGACY_BACKENDto be disabled. This mode is required to support models such as DeepSeek R1, Llama 4, and others through the multi-LLM NIM container with the TRT-LLM backend. There may also be a performance benefit for other models. Default value:1NIM_USE_TRTLLM_LEGACY_BACKENDSet to
0to disable the TRT-LLM legacy backend. Default value:1
Per-Request Metrics#
This variable enables detailed per-request performance metrics in API responses, providing granular insights into token processing, timing, and throughput for individual requests.
NIM_PER_REQ_METRICS_ENABLESet to
1to enable per-request metrics in API responses. When enabled,/v1/completionsand/v1/chat/completionsresponses will include astatsfield with detailed timing and token metrics for each request. This is an experimental feature that may change in future releases. Default value:0
Note
Per-request metrics provide different timing granularity depending on the backend:
vLLM: High precision with queue timing separation and pure LLM processing metrics
TensorRT-LLM: Estimation-based with end-to-end timing (includes queue time)
For more details, see here.
Multi-LLM NIM#
When using the multi-LLM compatible NIM container with other supported models, the following are additional environment variables that can be used to tune the behavior per the above instructions.
NIM_CHAT_TEMPLATEThe absolute path to the
.jinjafile that contains the chat template. Useful for instructing the LLM to format the output response in a way that the tool-call parser can understand. Default value:NoneNIM_ENABLE_AUTO_TOOL_CHOICESet to
1to enable tool calling functionality. Default value:0NIM_PIPELINE_PARALLEL_SIZENIM will pick the pipeline parallel size that is provided here. Default value:
NoneNIM_TENSOR_PARALLEL_SIZENIM will pick the tensor parallel size that is provided here. Default value:
NoneNIM_TOOL_CALL_PARSERHow the model post-processes the LLM response text into a tool call data structure. Possible values are
"pythonic","mistral","llama3_json","granite-20b-fc","granite","hermes","jamba", or a custom value. Default value:NoneNIM_TOOL_PARSER_PLUGINThe absolute path of a python file that is a custom tool-call parser. Required when
NIM_TOOL_CALL_PARSERis specified with a custom value. Default value:NoneNIM_SERVED_MODEL_NAMEThe model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will match the name in the request. If not specified, the model name will be inferred from the HF URL. For local model paths, NIM will set the absolute path to the local model directory as model name by default. It is highly recommended to set this environment variable in such cases. Note that this name(s) will also be used in
model_nametag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one. Default value:NoneNIM_FORCE_TRUST_REMOTE_CODESet this to
1to make sure models which require the flag--trust-remote-codehave it turned on when using the multi-LLM NIM. For example, Llama Nemotron Super 49B needs this flag enabled when running the model via the multi-LLM NIM. Default value:None
LLM-specific NIMs#
For LLM-specific NIMs downloaded from NVIDIA, the following are additional environment variables that can be used to tune the behavior per the above instructions.
NIM_FT_MODELPoints to the path of the custom fine-tuned weights in the container. Default value:
NoneNIM_MANIFEST_ALLOW_UNSAFESet to
1to enable selection of a model profile not included in the originalmodel_manifest.yaml. If set, you must also setNIM_MODEL_NAMEto be the path to the model directory or an NGC path. Default value:0NIM_SERVED_MODEL_NAMEThe model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will match the name in the request. If not specified, the model name will be inferred from the manifest located at
/opt/nim/etc/default/model_manifest.yaml. Note that this name(s) will also be used inmodel_nametag content of Prometheus metrics; if multiple names are provided, the metrics tag will take the first one Default value:None
Volumes#
These settings define how to mount local file system paths into the NIM container.
/opt/nim/.cacheThis is the default directory where models are downloaded and cached inside the container. Mount a directory from your host machine to this path to preserve the cache between container runs. If this volume is not mounted, the container will download the model every time it starts. You can customize this path with the
NIM_CACHE_PATHenvironment variable.For example, to use
~/.cache/nimon your host machine as the cache directory:Create the directory on your host:
mkdir -p ~/.cache/nimMount the directory by running the
docker runcommand with the-vand-uoptions:docker run ... -v ~/.cache/nim:/opt/nim/.cache -u $(id -u)