NGC_API_KEY Yes None You must set this variable to the value of your personal NGC API key.

NIM_CACHE_PATH No /opt/nim/.cache Location (in container) where the container caches model artifacts.

NIM_DISABLE_LOG_REQUESTS No 1 Set to 0 to view request logs. By default, logs of request details to v1/completions and v1/chat/completions are disabled. These logs contain sensitive attributes of the request including prompt , sampling_params , and prompt_token_ids . Users should be aware that these attributes will be exposed to container logs when enabling this parameter.

NIM_JSONL_LOGGING No 0 Set to 1 to enable JSON-formatted logs. Readable text logs are enabled by default.

NIM_LOG_LEVEL No DEFAULT Log level of NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. TRACE log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in uvicorn . When NIM_LOG_LEVEL is DEFAULT sets all log levels to INFO except for TRT-LLM log level which equals ERROR . When NIM_LOG_LEVEL is CRITICAL TRT-LLM log level is ERROR .

NIM_SERVER_PORT No 8000 Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the -p/--publish flag of docker run to reflect that (ex: -p $NIM_SERVER_PORT:$NIM_SERVER_PORT ). The left-hand side of this : is your host address:port, and does NOT have to match with $NIM_SERVER_PORT . The right-hand side of the : is the port inside the container which MUST match NIM_SERVER_PORT (or 8000 if not set).

NIM_MODEL_PROFILE No None Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at /opt/nim/etc/default/model_manifest.yaml . If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. A list of the compatible profiles can be obtained by appending list-model-profiles at the end of the docker run command. Using the profile name default will select a profile that is maximally compatible and may not be optimal for your hardware.

NIM_MANIFEST_ALLOW_UNSAFE No 0 If set to 1 , enable selection of a model profile not included in the original model_manifest.yaml . If set, you must also set NIM_MODEL_NAME to be the path to the model directory or an NGC path.

NIM_MODEL_NAME No “Model Name” Must be set only if NIM_MANIFEST_ALLOW_UNSAFE is set to 1 . This must be a path to a model directory or an NGC path of the form ngc://<org>/<team>/<model_name>:<version> . An example: ngc://nim/meta/llama3-8b-instruct:hf .

NIM_PEFT_SOURCE No If you want to enable PEFT inference with local PEFT modules, then set a NIM_PEFT_SOURCE environment variable and pass that into the run container command. If your PEFT source is a local directory at LOCAL_PEFT_DIRECTORY , mount your local PEFT directory to the container’s PEFT source set by NIM_PEFT_SOURCE . Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM.

NIM_MAX_LORA_RANK No 32 Set the maximum LoRA rank.

NIM_MAX_GPU_LORAS No 8 Set the number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch.

NIM_MAX_CPU_LORAS No 16 Set the number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than NIM_MAX_CPU_LORAS you may see “cache is full” errors. This value must be >= NIM_MAX_GPU_LORAS.

NIM_PEFT_REFRESH_INTERVAL No None How often to check NIM_PEFT_SOURCE for new models in seconds. If not set, PEFT cache will not refresh. If you choose to enable PEFT refreshing by setting this ENV var, we recommend setting the number greater than 30.

NIM_SERVED_MODEL_NAME No None The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at /opt/nim/etc/default/model_manifest.yaml . Note that this name(s) will also be used in model_name tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one.

NIM_CUSTOM_MODEL_NAME No None The model name given to a locally-built engine. If set, the locally-built engine will be named NIM_CUSTOM_MODEL_NAME and be cached with the same name in the NIM Cache. This name must be non-duplicate to other cached custom engines. This cached engine will also be visible with the same name with the list-model-profiles command and will behave like every other profile. On subsequent docker runs, a locally cached engine will take precedence over every other type of profile. You may also set NIM_MODEL_PROFILE to be a specific custom model name to force NIM LLM to serve that cached engine.

NIM_LOW_MEMORY_MODE No 0 Set this flag to 1 to enable offloading the locally-built TRTLLM engines to disk. This reduces runtime host memory requirement, but may increase the startup time and disk usage.

NIM_ENABLE_OTEL No 0 Set this flag to 1 to enable OpenTelemetry instrumentation in NIMs.

NIM_OTEL_TRACES_EXPORTER No console Specifies the OpenTelemetry exporter to use for tracing. Set this flag to otlp to export the traces using the OpenTelemetry Protocol. Set it to console to print the traces to standard output.

NIM_OTEL_METRICS_EXPORTER No console Similar to NIM_OTEL_TRACES_EXPORTER , but for metrics.

NIM_OTEL_EXPORTER_OTLP_ENDPOINT No None The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration.

NIM_OTEL_SERVICE_NAME No None Sets the name of your service to help with identifying and categorizing data.

NIM_TOKENIZER_MODE No auto The tokenizer mode. auto will use the fast tokenizer if available. slow will always use the slow tokenizer.

NIM_ENABLE_KV_CACHE_REUSE No 0 Set to 1 to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference.

NIM_RELAX_MEM_CONSTRAINTS No 0 If set to 1 and NIM_NUM_KV_CACHE_SEQ_LENS not specified then NIM_NUM_KV_CACHE_SEQ_LENS will automatically be set to 1 . Otherwise if set to 1 will use value provided from NIM_NUM_KV_CACHE_SEQ_LENS . The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and determines if GPU meets memory requirements based on memory estimates.

NIM_NUM_KV_CACHE_SEQ_LENS No None NIM_RELAX_MEM_CONSTRAINTS must be set to 1 for this environment variable to take effect. Set to a value greater than or equal to 1 to override the default KV cache memory allocation settings for NIM LLM. The value provided will be used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The maximum sequence length is the context size of the model.

NIM_MAX_MODEL_LEN No None Model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the vLLM backend and models where the selected profile has trtllm-buildable equal to true . In the case where trtllm-buildable is equal to true the TRT-LLM build parameter max_seq_len will be set to this value.