Configuring a NIM#
NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.
GPU Selection#
Passing --gpus all
to docker run
is acceptable in homogeneous environments with 1 or more of the same GPU.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
the
--gpus
flag (ex:--gpus='"device=1"'
)the environment variable
NVIDIA_VISIBLE_DEVICES
(ex:-e NVIDIA_VISIBLE_DEVICES=1
)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L
:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to the NVIDIA Container Toolkit documentation for more instructions.
Environment Variables#
Below is a reference for REQUIRED and No environment variables that can be passed into a NIM (-e
added to docker run
):
ENV |
Required? |
Default |
Notes |
---|---|---|---|
|
Yes |
None |
You must set this variable to the value of your personal NGC API key. |
|
No |
|
Location (in container) where the container caches model artifacts. |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
Log level of NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. |
|
No |
|
Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the |
|
No |
None |
Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at |
|
No |
|
If set to |
|
No |
If you want to enable PEFT inference with local PEFT modules, then set a |
|
|
No |
|
set the max LoRA rank |
|
No |
|
set number of LoRAs that can fit in GPU PEFT cache. This is the max number of LoRAs that can be used in a single batch. |
|
No |
|
set number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than |
|
No |
|
How often to check |
|
No |
|
The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at |
Volumes#
Below are the paths inside the container into which local paths can be mounted.
Container path |
Required? |
Notes |
Docker argument example |
---|---|---|---|
|
Not required, but if this volume is not mounted, the container will do a fresh download of the model each time it is brought up. |
This is the directory within which models are downloaded inside the container. It is very important that this directory could be accessed from inside the container. This can be achieved by adding the option |
|