Configuring a NIM#
NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.
GPU Selection#
Passing --gpus all
to docker run
is acceptable in homogeneous environments with one or more of the same GPU.
Note
--gpus all
only works if your configuration has the same number of GPUs as specified for the model in the Supported Models.
Running an inference on a configuration with fewer or more GPUs can result in a runtime error.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
the
--gpus
flag (ex:--gpus='"device=1"'
)the environment variable
NVIDIA_VISIBLE_DEVICES
(ex:-e NVIDIA_VISIBLE_DEVICES=1
)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L
:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to the NVIDIA Container Toolkit documentation for more instructions.
How many GPUs do I need?#
Each Profile will have a TP
(Tensor Parallelism) and PP
(Pipeline Parallelism), decipherable through their readable name (example: tensorrt_llm-trtllm_buildable-bf16-tp8-pp2
).
In most cases, you will need TP * PP
amount of GPUs to run a specific profile.
For example, for the profile tensorrt_llm-trtllm_buildable-bf16-tp8-pp2
you will need either 2
nodes with 8
GPUs or 2 * 8 = 16
GPUs on one Node.
Environment Variables#
Below is a reference for REQUIRED and No environment variables that can be passed into a NIM (-e
added to docker run
):
ENV |
Required? |
Default |
Notes |
---|---|---|---|
|
Yes |
None |
You must set this variable to the value of your personal NGC API key. |
|
No |
|
Location (in container) where the container caches model artifacts. |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
Log level of NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. |
|
No |
|
Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the |
|
No |
None |
Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at |
|
No |
|
If set to |
|
No |
“Model Name” |
Must be set only if |
|
No |
If you want to enable PEFT inference with local PEFT modules, then set a |
|
|
No |
|
Set the maximum LoRA rank. |
|
No |
|
Set the number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch. |
|
No |
|
Set the number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than |
|
No |
|
How often to check |
|
No |
|
The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at |
|
No |
|
The model name given to a locally-built engine. If set, the locally-built engine will be named |
|
No |
|
Set this flag to |
|
No |
|
Set this flag to |
|
No |
|
Specifies the OpenTelemetry exporter to use for tracing. Set this flag to |
|
No |
|
Similar to |
|
No |
|
The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration. |
|
No |
|
Sets the name of your service to help with identifying and categorizing data. |
|
No |
|
The tokenizer mode. |
|
No |
|
Set to |
|
No |
|
If set to |
|
No |
|
|
|
No |
|
Model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the vLLM backend and models where the selected profile has |
|
No |
|
If set to a non-empty string, the |
|
No |
|
The guided decoding backend to use. Can be one of |
|
No |
|
Set to |
|
No |
|
Main switch to enable SSL/TLS or skip environment variables |
|
Required if |
|
Path to the server’s TLS private key file (required for TLS HTTPS). It’s used to decrypt incoming messages and sign outgoing ones. |
|
Required if |
|
Path to the server’s certificate file (required for TLS HTTPS). It contains the public key and server identification information. |
|
Required if |
|
Path to a the CA (certificate Authority) certificate. |
This file is used to verify client certificates in mutual TLS (mTLS) setups. |
Volumes#
Local paths can be mounted to the following container paths.
Container path |
Required? |
Notes |
Docker argument example |
---|---|---|---|
|
No; however, if this volume is not mounted, the container does a fresh download of the model every time the container starts. |
This directory is to where models are downloaded inside the container. You can access this directory from within the container by adding the |
|