Configure Your NIM with NVIDIA NIM for LLMs#
NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.
GPU Selection#
Passing --gpus all
to docker run
is acceptable in homogeneous environments with one or more of the same GPU.
Note
--gpus all
only works if your configuration has the same number of GPUs as specified for the model in Supported Models for NVIDIA NIM for LLMs.
Running an inference on a configuration with fewer or more GPUs can result in a runtime error.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
the
--gpus
flag (ex:--gpus='"device=1"'
)the environment variable
NVIDIA_VISIBLE_DEVICES
(ex:-e NVIDIA_VISIBLE_DEVICES=1
)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L
:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to the NVIDIA Container Toolkit documentation for more instructions.
How many GPUs do I need?#
Optimized Models#
For models that have been optimized by NVIDIA, there are recommended tensor and pipeline parallelism configurations.
Each Profile will have a TP
(Tensor Parallelism) and PP
(Pipeline Parallelism), decipherable through their readable name (example: tensorrt_llm-trtllm_buildable-bf16-tp8-pp2
).
In most cases, you will need TP * PP
amount of GPUs to run a specific profile.
For example, for the profile tensorrt_llm-trtllm_buildable-bf16-tp8-pp2
you will need either 2
nodes with 8
GPUs or 2 * 8 = 16
GPUs on one Node.
Other Models#
For LLM NIM supported models, NIM will attempt to set TP to all the exposed GPUs in the container. Users can set NIM_TENSOR_PARALLEL_SIZE
and NIM_PIPELINE_PARALLEL_SIZE
to specify any arbitrary inference configuration.
In most cases, you will need TP * PP
amount of GPUs to run a specific profile. For more information about profiles, refer to Model Profiles in NVIDIA NIM for LLMs.
Environment Variables#
The following are the environment variables that you can pass to a NIM (-e
added to docker run
).
Variable |
Required? |
Default |
Notes |
---|---|---|---|
|
Yes |
— |
Your personal NGC API key for downloading inference containers. For LLM-specific NIMs, set this for downloading NGC models from manifest. |
|
No |
|
The location in the container where the container caches model artifacts. |
|
No |
|
The path to a directory of custom guided decoding backend directories. See Custom Guided Decoding Backends with NVIDIA NIM for LLMs for details. |
|
No |
|
The model name given to a locally-built engine. If set, the locally-built engine is named |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
|
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
The guided decoding backend to use. Can be one of |
|
No |
|
Set to |
|
No |
|
The fraction of free host memory to use for KV cache host offloading. This only takes effect if |
|
No |
|
The log level of the NIM for LLMs service. Possible values of the variable are |
|
No |
|
Set to |
|
No |
|
The maximum batch size for TRTLLM inference. If unspecified, will be automatically derived from the detected GPUs. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has |
|
No |
|
The number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than |
|
No |
|
The number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch. |
|
No |
|
The maximum LoRA rank. |
|
No |
|
The model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the TRTLLM backend and models where the selected profile has |
|
No |
“Model Name” |
The path to a model directory. For LLM-specific NIMs, set this only if |
|
No |
None |
Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at |
|
No |
|
Set to a value greater than or equal to |
|
No |
|
The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration. |
|
No |
|
Similar to |
|
No |
|
The name of your service, to help with identifying and categorizing data. |
|
No |
|
The OpenTelemetry exporter to use for tracing. Set this flag to |
|
No |
|
How often to check |
|
No |
If you want to enable PEFT inference with local PEFT modules, then set a |
|
|
No |
|
If set to |
|
No |
|
If set to a non-empty string, the |
|
No |
None |
The range in generation logits to extract reward scores. It should be a comma-separated list of two integers. For example, |
|
No |
|
Set to |
|
No |
None |
The reward model string. Supported in version 1.10 and later. |
|
No |
|
The runtime scheduler policy to use. Possible values: |
|
No |
|
Set to |
|
No |
|
Publish the NIM service to the specified port inside the container. Make sure to adjust the port passed to the |
|
Required if |
|
The path to the CA (certificate Authority) certificate. |
|
Required if |
|
The path to the server’s certificate file (required for TLS HTTPS). It contains the public key and server identification information. |
|
Required if |
|
The path to the server’s TLS private key file (required for TLS HTTPS). It’s used to decrypt incoming messages and sign outgoing ones. |
|
No |
|
Specify a value to enable SSL/TLS in served endpoints or skip environment variables |
|
No |
|
The tokenizer mode. |
|
No |
|
Set to |
|
No |
|
Set this to |
|
No |
|
The path to the SSL certificate used for downloading models when NIM is run behind a proxy. The certificate of the proxy must be used together with |
|
No |
|
The maximum number of parallel download requests when downloading models. |
|
No |
|
Percentage of total GPU memory to allocate for the key-value (KV) cache during model inference. Considering a machine with 80GB of GPU memory, where the model weights occupy 60GB, setting |
|
No |
|
Controls which TensorRT-LLM backend to use. If |
|
No |
|
When |
LLM-specific NIM Environment Variables#
For LLM-specific NIMs downloaded from NVIDIA, the following are additional environment variables that can be used to tune the behavior per the above instructions.
Variable |
Required? |
Default |
Notes |
---|---|---|---|
|
No |
Points to the path of the custom fine-tuned weights in the container. |
|
|
No |
|
Set to |
|
No |
|
The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at |
LLM-agnostic NIM Environment Variables#
When using the LLM-agnostic NIM container with other supported models, the following are additional environment variables that can be used to tune the behavior per the above instructions.
Variable |
Required? |
Default |
Notes |
---|---|---|---|
|
No |
|
The absolute path to the |
|
No |
|
Set to |
|
No |
|
NIM will pick the pipeline parallel size that is provided here. |
|
No |
|
NIM will pick the tensor parallel size that is provided here. |
|
No |
|
How the model post-processes the LLM response text into a tool call data structure. One of : |
|
No |
|
The absolute path of a python file that is a custom tool-call parser. Required when |
|
No |
|
The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the HF URL. For local model paths, NIM will set the absolute path to the local model directory as model name by default. It is highly recommended to set this environment variable in such cases. Note that this name(s) will also be used in |
Volumes#
Local paths can be mounted to the following container paths.
Container path |
Required? |
Notes |
Docker argument example |
---|---|---|---|
|
No; however, if this volume is not mounted, the container does a fresh download of the model every time the container starts. |
This directory is to where models are downloaded inside the container. You can access this directory from within the container by adding the |
|