NGC_API_KEY
|
Yes |
— |
Your personal NGC API key. |
NIM_CACHE_PATH
|
No |
/opt/nim/.cache
|
The location in the container where the container caches model artifacts. |
NIM_CUSTOM_GUIDED_DECODING_BACKENDS
|
No |
None
|
The path to a directory of custom guided decoding backend directories see custom guided decoding backend for details. |
NIM_CUSTOM_MODEL_NAME
|
No |
None
|
The model name given to a locally-built engine. If set, the locally-built engine is named NIM_CUSTOM_MODEL_NAME and is cached with the same name in the NIM Cache. This name must be non-duplicate to other cached custom engines. This cached engine will also be visible with the same name with the list-model-profiles command and will behave like every other profile. On subsequent docker runs, a locally cached engine will take precedence over every other type of profile. You may also set NIM_MODEL_PROFILE to be a specific custom model name to force NIM LLM to serve that cached engine. |
NIM_DISABLE_CUDA_GRAPH
|
No |
0
|
Set to 1 to disable the use of CUDA graph. |
NIM_DISABLE_LOG_REQUESTS
|
No |
1
|
Set to 0 to view logs of request details to v1/completions and v1/chat/completions . These logs contain sensitive attributes of the request including prompt , sampling_params , and prompt_token_ids . You should be aware that these attributes are exposed to the container logs when you set this to 0 . |
NIM_DISABLE_OVERLAP_SCHEDULING
|
No |
0
|
Set to 1 to disable CPU overlap with forward pass. Supported in version 1.7 only. |
NIM_ENABLE_DP_ATTENTION
|
No |
0
|
1 to enable DP attention when using SGLang.
|
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
|
No |
None
|
Set to 1 to enable host-based KV cache offloading, 0 to disable. This only takes effect with the TensorRT-LLM backend and if NIM_ENABLE_KV_CACHE_REUSE is set to 1 . Leave unset (None ) to use the optimal offloading strategy for your system. |
NIM_ENABLE_KV_CACHE_REUSE
|
No |
0
|
Set to 1 to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference. |
NIM_ENABLE_OTEL
|
No |
0
|
Set to 1 to enable OpenTelemetry instrumentation in NIMs. |
NIM_ENABLE_PROMPT_LOGPROBS
|
No |
0
|
Set to 1 to enable a buildable path for context logits generation, allowing echo functionality to work with log probabilities and also enable top_logprobs feature for the response. Supported in version 1.8 and later. |
NIM_FORCE_DETERMINISTIC
|
No |
0
|
Set to 1 to force deterministic builds and enable runtime deterministic behavior. Supported in version 1.10 and later. |
NIM_FT_MODEL
|
No |
|
Points to the path of the custom fine-tuned weights in the container. Supported in version 1.8 and later. |
NIM_GUIDED_DECODING_BACKEND
|
No |
"xgrammar"
|
The guided decoding backend to use. Can be one of "xgrammar" , "outlines" , "lm-format-enforcer" or a custom guided decoding backend. |
NIM_JSONL_LOGGING
|
No |
0
|
Set to 1 to enable JSON-formatted logs. Readable text logs are enabled by default. |
NIM_KV_CACHE_HOST_MEM_FRACTION
|
No |
0.1
|
The fraction of free host memory to use for KV cache host offloading. This only takes effect if NIM_ENABLE_KV_CACHE_HOST_OFFLOAD is enabled. |
NIM_LOG_LEVEL
|
No |
DEFAULT
|
The log level of the NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. TRACE log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in uvicorn . When NIM_LOG_LEVEL is DEFAULT sets all log levels to INFO except for TRT-LLM log level which equals ERROR . When NIM_LOG_LEVEL is CRITICAL TRT-LLM log level is ERROR . |
NIM_LOW_MEMORY_MODE
|
No |
0
|
Set to 1 to enable offloading the locally-built TRTLLM engines to disk. This reduces runtime host memory requirement, but may increase the startup time and disk usage. |
NIM_MANIFEST_ALLOW_UNSAFE
|
No |
0
|
Set to 1 to enable selection of a model profile not included in the original model_manifest.yaml . If set, you must also set NIM_MODEL_NAME to be the path to the model directory or an NGC path. |
NIM_MAX_BATCH_SIZE
|
No |
None
|
The maximum batch size for TRTLLM inference. If unspecified, will be automatically derived from the detected GPUs. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has trtllm-buildable equal to true . In the case where trtllm-buildable is equal to true the TRT-LLM build parameter max_batch_size will be set to this value. |
NIM_MAX_CPU_LORAS
|
No |
16
|
The number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than NIM_MAX_CPU_LORAS you may see “cache is full” errors. This value must be >= NIM_MAX_GPU_LORAS. |
NIM_MAX_GPU_LORAS
|
No |
8
|
The number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch. |
NIM_MAX_LORA_RANK
|
No |
32
|
The maximum LoRA rank. |
NIM_MAX_MODEL_LEN
|
No |
None
|
The model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has trtllm-buildable equal to true . In the case where trtllm-buildable is equal to true the TRT-LLM build parameter max_seq_len will be set to this value. |
NIM_MODEL_NAME
|
No |
“Model Name” |
The path to a model directory or an NGC path of the form ngc://<org>/<team>/<model_name>:<version> , for example ngc://nim/meta/llama3-8b-instruct:hf . Set this only if NIM_MANIFEST_ALLOW_UNSAFE is set to 1 . |
NIM_MODEL_PROFILE
|
No |
None |
Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at /opt/nim/etc/default/model_manifest.yaml . If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. A list of the compatible profiles can be obtained by appending list-model-profiles at the end of the docker run command. Using the profile name default will select a profile that is maximally compatible and may not be optimal for your hardware. |
NIM_NUM_KV_CACHE_SEQ_LENS
|
No |
None
|
Set to a value greater than or equal to 1 to override the default KV cache memory allocation settings for NIM LLM. The specified value is used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The maximum sequence length is the context size of the model. NIM_RELAX_MEM_CONSTRAINTS must be set to 1 for this environment variable to take effect. |
NIM_OTEL_EXPORTER_OTLP_ENDPOINT
|
No |
None
|
The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration. |
NIM_OTEL_METRICS_EXPORTER
|
No |
console
|
Similar to NIM_OTEL_TRACES_EXPORTER , but for metrics. |
NIM_OTEL_SERVICE_NAME
|
No |
None
|
The name of your service, to help with identifying and categorizing data. |
NIM_OTEL_TRACES_EXPORTER
|
No |
console
|
The OpenTelemetry exporter to use for tracing. Set this flag to otlp to export the traces using the OpenTelemetry Protocol. Set it to console to print the traces to standard output. |
NIM_PEFT_REFRESH_INTERVAL
|
No |
None
|
How often to check NIM_PEFT_SOURCE for new models, in seconds. If not set, PEFT cache will not refresh. If you choose to enable PEFT refreshing by setting this ENV var, we recommend setting the number greater than 30. |
NIM_PEFT_SOURCE
|
No |
|
If you want to enable PEFT inference with local PEFT modules, then set a NIM_PEFT_SOURCE environment variable and pass that into the run container command. If your PEFT source is a local directory at LOCAL_PEFT_DIRECTORY , mount your local PEFT directory to the container’s PEFT source set by NIM_PEFT_SOURCE . Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM. |
NIM_RELAX_MEM_CONSTRAINTS
|
No |
0
|
If set to 1 , use the value provided in NIM_NUM_KV_CACHE_SEQ_LENS . The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and runs the model regardless of memory constraints. It also uses heuristics to determine if GPU will likely meet or fail memory requirements and provides a warning if applicable. If set to 1 and NIM_NUM_KV_CACHE_SEQ_LENS not specified then NIM_NUM_KV_CACHE_SEQ_LENS is automatically set to 1 . |
NIM_REPOSITORY_OVERRIDE
|
No |
None
|
If set to a non-empty string, the NIM_REPOSITORY_OVERRIDE value replaces the hard-coded location of the repository and the protocol for access to the repository. The structure of the value for this environment variable is as follows: <repository type>://<repository location> . Only the protocols ngc:// , s3:// , and https:// are supported, and only the first component of the URI is replaced. For example: - If the URI in the manifest is ngc://org/meta/llama3-8b-instruct:hf?file=config.json and NIM_REPOSITORY_OVERRIDE=ngc://myrepo.ai/ , the domain name for the API endpoint is set to myrepo.ai . - If NIM_REPOSITORY_OVERRIDE=s3://mybucket/ , the result of the replacement will be s3://mybucket/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json . - If NIM_REPOSITORY_OVERRIDE=https://mymodel.ai/some_path_optional , the result of the replacement will be https://mymodel.ai/some_path/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json .
This repository override feature supports basic authentication mechanisms: - https assumes authorization using the Authorization header and the credential value in NIM_HTTPS_CREDENTIAL . - ngc requires a credential in the NGC_API_KEY environment variable. - s3 requires the environment variables AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , and (if using temporary credentials) AWS_SESSION_TOKEN . |
NIM_REWARD_LOGITS_RANGE
|
No |
None |
The range in generation logits to extract reward scores. It should be a comma-separated list of two integers. For example, "0,1" means the first Logit is the reward score and “3,5” means 4th and 5th logits are the reward score. Supported in version 1.8 and later. |
NIM_REWARD_MODEL
|
No |
0
|
Set to 1 to enable reward score collection from the model’s response. Supported in version 1.8 and later. |
NIM_REWARD_MODEL_STRING
|
No |
None |
The reward model string. Supported in version 1.10 and later. |
NIM_SCHEDULER_POLICY
|
No |
guarantee_no_evict
|
The runtime scheduler policy to use. Possible values: guarantee_no_evict or max_utilization . Default: guarantee_no_evict . Must be set only for the TRTLLM backend. It does not impact any vLLM profiles. |
NIM_SDK_USE_NATIVE_TLS
|
No |
0
|
Set to 1 to use native TLS stack for downloading from NGC. By default rustls-tls is used. rustls-tls can fail if a custom CA is used. For using native TLS, please (1) mount the certificate file, (2) provide path to the certificate in SSL_CERT_FILE , (3) set https_proxy to the address of the proxy. |
NIM_SERVED_MODEL_NAME
|
No |
None
|
The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at /opt/nim/etc/default/model_manifest.yaml . Note that this name(s) will also be used in model_name tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one. |
NIM_PROXY_CONNECTIVITY_TARGETS
|
No |
authn.nvidia.com,api.ngc.nvidia.com,xfiles.ngc.nvidia.com,huggingface.co,cas-bridge.xethub.hf.co
|
A comma-separated list of host names to verify through the proxy when https_proxy is set. These hosts are tested for connectivity during startup to ensure the proxy allows access to required services. If not set, the default list is used. If set to an empty string, no connectivity checks are performed. If connectivity checks fail, verify that your proxy allows connections to these domains. |
NIM_SERVER_PORT
|
No |
8000
|
Publish the NIM service to the specified port inside the container. Make sure to adjust the port passed to the -p/--publish flag of docker run to reflect that (ex: -p $NIM_SERVER_PORT:$NIM_SERVER_PORT ). The left-hand side of this : is your host address:port, and does NOT have to match with $NIM_SERVER_PORT . The right-hand side of the : is the port inside the container which MUST match NIM_SERVER_PORT (or 8000 if not set). |
NIM_SSL_CA_CERTS_PATH
|
Required if NIM_SSL_MODE="MTLS" |
None
|
The path to the CA (certificate Authority) certificate. |
NIM_SSL_CERTS_PATH
|
Required if NIM_SSL_MODE is enabled |
None
|
The path to the server’s certificate file (required for TLS HTTPS). It contains the public key and server identification information. |
NIM_SSL_KEY_PATH
|
Required if NIM_SSL_MODE is enabled |
None
|
The path to the server’s TLS private key file (required for TLS HTTPS). It’s used to decrypt incoming messages and sign outgoing ones. |
NIM_SSL_MODE
|
No |
"DISABLED"
|
Specify a value to enable SSL/TLS in served endpoints or skip environment variables NIM_SSL_KEY_PATH , NIM_SSL_CERTS_PATH , NIM_SSL_CA_CERTS_PATH . Possible values: (1) "DISABLED" - no HTTPS, (2) "TLS" - HTTPS with only server-side TLS (client certificate not required), (3) "MTLS" - HTTPS with mTLS (client certificate required). If "TLS" is used, then NIM_SSL_CERTS_PATH , NIM_SSL_KEY_PATH are required. If "MTLS" is used, then NIM_SSL_CERTS_PATH , NIM_SSL_KEY_PATH , NIM_SSL_CA_CERTS_PATH are required. |
NIM_TOKENIZER_MODE
|
No |
auto
|
The tokenizer mode. auto will use the fast tokenizer if available. slow will always use the slow tokenizer. |
NIM_TRUST_CUSTOM_CODE
|
No |
0
|
Set to 1 to enable custom guided decoding backend. This enables arbitrary Python code execution as part of the custom guided decoding. |
SSL_CERT_FILE
|
No |
None
|
The path to the SSL certificate used for downloading models when NIM is run behind a proxy. The certificate of the proxy must be used together with NIM_SDK_USE_NATIVE_TLS and https_proxy environment variables. |
NIM_SDK_MAX_PARALLEL_DOWNLOAD_REQUESTS
|
No |
1
|
The maximum number of parallel download requests when downloading models. |