Configuring a NIM
NVIDIA NIM for LLMs use Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.
Passing --gpus all
to docker run
is acceptable in homogeneous environments with 1 or more of the same GPU.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
the
--gpus
flag (ex:--gpus='"device=1"'
)the environment variable
NVIDIA_VISIBLE_DEVICES
(ex:-e NVIDIA_VISIBLE_DEVICES=1
)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L
:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to the NVIDIA Container Toolkit documentation for more instructions.
Passing --shm-size=16GB
to docker run
is required when not using NVLink for multi-GPU setups. It’s not required on SXM systems or when using profiles using only 1 GPU (e.g NIM_TENSOR_PARALLEL_SIZE=1
).
Below is a reference for REQUIRED and No environment variables that can be passed into a NIM (-e
added to docker run
):
ENV |
Required? |
Default |
Notes |
---|---|---|---|
NGC_API_KEY |
Yes | None | You must set this variable to the value of your personal NGC API key. |
NIM_CACHE_PATH |
No | /opt/nim/.cache |
Location (in container) where the container caches model artifacts. |
NIM_DISABLE_LOG_REQUESTS |
No | 1 |
Set to 0 to view request logs. By default, logs of request details to v1/completions and v1/chat/completions are disabled. These logs contain sensitive attributes of the request including prompt , sampling_params , and prompt_token_ids . Users should be aware that these attributes will be exposed to container logs when enabling this parameter. |
NIM_JSONL_LOGGING |
No | 0 |
Set to 1 to enable JSON-formatted logs. Readable text logs are enabled by default. |
NIM_LOG_LEVEL |
No | DEFAULT |
Log level of NVIDIA NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. TRACE log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in uvicorn . When NIM_LOG_LEVEL is DEFAULT sets all log levels to INFO except for TRT-LLM log level which equals ERROR . When NIM_LOG_LEVEL is CRITICAL TRT-LLM log level is ERROR . |
NIM_SERVER_PORT |
No | 8000 |
Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the -p/--publish flag of docker run to reflect that (ex: -p $NIM_SERVER_PORT:$NIM_SERVER_PORT ). The left-hand side of this : is your host address:port, and doesn’t have to match with $NIM_SERVER_PORT . The right-hand side of the : is the port inside the container which MUST match NIM_SERVER_PORT (or 8000 if not set). |
NIM_MODEL_PROFILE |
No | None | Override the NIM optimization profile that’s automatically selected by specifying a profile ID from the manifest located at /etc/nim/config/model_manifest.yaml . If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. A list of the compatible profiles can be obtained by appending list-model-profiles at the end of the docker run command. Using the profile name default will select a profile that’s maximally compatible and may not be optimal for your hardware. |
NIM_MANIFEST_ALLOW_UNSAFE |
No | 0 |
If set to 1 , enable selection of a model profile not included in the original model_manifest.yaml or a profile that’s not detected to be compatible with the deployed hardware. |
NIM_PEFT_SOURCE |
No | If you want to enable PEFT inference with local PEFT modules, then set a NIM_PEFT_SOURCE environment variable and pass that into the run container command. If your PEFT source is a local directory at LOCAL_PEFT_DIRECTORY , mount your local PEFT directory to the container’s PEFT source set by NIM_PEFT_SOURCE . Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM. |
|
NIM_MAX_LORA_RANK |
No | 32 |
set the max LoRA rank |
NIM_MAX_GPU_LORAS |
No | 8 |
set number of LoRAs that can fit in GPU PEFT cache. This is the max number of LoRAs that can be used in a single batch. |
NIM_MAX_CPU_LORAS |
No | 16 |
set number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than NIM_MAX_CPU_LORAS you may see “cache is full” errors. This value must be >= NIM_MAX_GPU_LORAS |
NIM_PEFT_REFRESH_INTERVAL |
No | None |
How often to check NIM_PEFT_SOURCE for new models in seconds. If not set, PEFT cache won’t refresh. If you choose to enable PEFT refreshing by setting this ENV var, we recommend setting the number greater than 30 |
Below are the paths inside the container into which local paths can be mounted.
Container path |
Required? |
Notes |
Docker argument example |
---|---|---|---|
/opt/nim/.cache (or NIM_CACHE_PATH if present) |
Not required, but if this volume isn’t mounted, the container will do a fresh download of the model each time it’s brought up. | This is the directory within which models are downloaded inside the container. It’s very important that this directory could be accessed from inside the container. This can be achieved by adding the option -u $(id -u) to the docker run command. For example, to use ~/.cache/nim as the host machine directory for caching models, first do mkdir -p ~/.cache/nim before running the docker run ... command. |
-v ~/.cache/nim:/opt/nim/.cache -u $(id -u) . |