Configuring a NIM#
NVIDIA NIM for LLMs use Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.
GPU Selection#
Passing --gpus all
to docker run
is acceptable in homogeneous environments with 1 or more of the same GPU.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
the
--gpus
flag (ex:--gpus='"device=1"'
)the environment variable
NVIDIA_VISIBLE_DEVICES
(ex:-e NVIDIA_VISIBLE_DEVICES=1
)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L
:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to the NVIDIA Container Toolkit documentation for more instructions.
Environment Variables#
Below is a reference for REQUIRED and No environment variables that can be passed into a NIM (-e
added to docker run
):
ENV |
Required? |
Default |
Notes |
---|---|---|---|
|
Yes |
None |
You must set this variable to the value of your personal NGC API key. |
|
No |
|
Location (in container) where the container caches model artifacts. |
|
No |
|
Set to |
|
No |
|
Set to |
|
No |
|
Log level of NVIDIA NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. |
|
No |
|
Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the |
|
No |
None |
Override the NIM optimization profile that’s automatically selected by specifying a profile ID from the manifest located at |
|
No |
|
If set to |
|
No |
If you want to enable PEFT inference with local PEFT modules, then set a |
|
|
No |
|
set the max LoRA rank |
|
No |
|
set number of LoRAs that can fit in GPU PEFT cache. This is the max number of LoRAs that can be used in a single batch. |
|
No |
|
set number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than |
|
No |
|
How often to check |
Volumes#
Below are the paths inside the container into which local paths can be mounted.
Container path |
Required? |
Notes |
Docker argument example |
---|---|---|---|
|
Not required, but if this volume isn’t mounted, the container will do a fresh download of the model each time it’s brought up. |
This is the directory within which models are downloaded inside the container. It’s very important that this directory could be accessed from inside the container. This can be achieved by adding the option |
|