Configure Your NIM#

NVIDIA NIM for VLMs uses Docker containers under the hood. Each NIM has its own Docker container and has several ways to configure it. Below is a complete reference on how to configure an NIM container.

GPU Selection#

Passing --gpus all to docker run is acceptable in homogeneous environments with one or more of the same GPU.

In heterogeneous environments with a combination of GPUs (A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:

The --gpus flag (ex: --gpus='"device=1"')
The environment variable CUDA_VISIBLE_DEVICES (ex: -e CUDA_VISIBLE_DEVICES=1)

The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:

GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)

Refer to the NVIDIA Container Toolkit documentation for more instructions.

Shared Memory Flag#

Passing --shm-size=16GB to docker run is required when not using NVLink for multi-GPU setups. It is unnecessary on SXM systems or when using profiles using only 1 GPU (e.g., NIM_TENSOR_PARALLEL_SIZE=1).

Environment Variables#

Below is a reference for environment variables that can be passed into a NIM (-e added to docker run):

ENV	Required?	Default	Notes
NGC_API_KEY	Yes	None	You must set this variable to the value of your personal NGC API key.
NIM_CACHE_PATH	No	/opt/nim/.cache	Location (in container) where the container caches model artifacts.
NIM_DISABLE_LOG_REQUESTS	No	1	Set to 0 to view request logs. By default, logs of request details to v1/chat/completions are disabled. These logs contain sensitive attributes of the request including prompt, sampling_params, and prompt_token_ids. Users should be aware that these attributes will be exposed to container logs when enabling this parameter.
NIM_JSONL_LOGGING	No	0	Set to 1 to enable JSON-formatted logs. Readable text logs are enabled by default.
NIM_LOG_LEVEL	No	DEFAULT	Log level of NVIDIA NIM for VLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. TRACE log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in uvicorn. When NIM_LOG_LEVEL is DEFAULT sets all log levels to INFO except for TRT- LLM log level which equals ERROR. When NIM_LOG_LEVEL is CRITICAL TRT-LLM log level is ERROR.
NIM_SERVER_PORT	No	8000	Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the -p/ –publish flag of docker run to reflect that (ex: -p $NIM_SERVER_PORT:$NIM_SERVER_PORT). The left-hand side of this : is your host address:port, and does NOT have to match with $NIM_SERVER_PORT. The right-hand side of the : is the port inside the container which MUST match NIM_SERVER_PORT (or 8000 if not set).
NIM_MODEL_PROFILE	No	None	Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at /etc/nim/config/model_manifest.yaml. If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. A list of the compatible profiles can be obtained by appending list-model-profiles at the end of the docker run command. Using the profile name default will select a profile that is maximally compatible and may not be optimal for your hardware.
NIM_MANIFEST_ALLOW_UNSAFE	No	0	If set to 1, enable selection of a model profile not included in the original model_manifest.yaml or a profile that is not detected to be compatible with the deployed hardware.
NIM_SERVED_MODEL_NAME	No	None	The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at `/etc/nim/config/model_manifest.yaml`. Note that this name(s) will also be used in model_name tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one.
NIM_ENABLE_OTEL	No	0	Set this flag to 1 to enable OpenTelemetry instrumentation in NIMs.
OTEL_TRACES_EXPORTER	No	console	Specifies the OpenTelemetry exporter to use for tracing. Set this flag to `otlp` to export the traces using the OpenTelemetry Protocol. Set it to `console` to print the traces to standard output.
OTEL_METRICS_EXPORTER	No	console	Similar to `OTEL_TRACES_EXPORTER`, but for metrics.
OTEL_EXPORTER_OTLP_ENDPOINT	No	None	The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration.
OTEL_SERVICE_NAME	No	None	Sets the name of your service to help with identifying and categorizing data.
NIM_TOKENIZER_MODE	No	auto	The tokenizer mode. `auto` will use the fast tokenizer if available. `slow` will always use the slow tokenizer.
NIM_ENABLE_KV_CACHE_REUSE	No	0	Set to 1 to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference.
OTEL_SERVICE_NAME	No	None	Sets the name of your service to help with identifying and categorizing data.
NIM_MAX_NUM_SEQS	No	None	Maximum number of sequences that can be processed in parallel. This can be set to a lower value to limit memory usage at the cost of performance. If unspecified, will be automatically derived from the model configuration (TRT-LLM) or vLLM default
NIM_MAX_MODEL_LEN	No	None	Model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the vLLM backend.
NIM_KVCACHE_PERCENT	No	0.9	Percentage of the free memory allocated to the KV cache. This can be set to a lower value to limit memory usage at the cost of performance or losing long-context support.
NIM_ENCODER_BATCHING_MS	No	10	Time for the vision encoder to wait for additional requests to arrive. Increasing the opportunity of batching vision processing at the cost of latency.

Volumes#

Below are the paths inside the container to mount local paths.

Container path

Required?

Notes

Docker argument example

/opt/nim/.cache (or NIM_CACHE_PATH if present)

Not required, but if this volume is not mounted, the container will do a fresh download of the model each time it is brought up.

This is the directory within which models are downloaded inside the container. It is very important that this directory could be accessed from inside the container. This can be achieved by adding the option -u $(id -u) to the docker run command.

For example, to use ~/.cache/nim as the host machine directory for caching models, first do mkdir -p ~/.cache/nim before running the docker run ... command.

-v ~/.cache/nim:/opt/nim/.cache -u $(id -u).

Advanced Performance Configuration#

`NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN`#

This parameter can maximize system throughput by balancing memory allocation between text and image KV caches, ensuring maximum concurrent sequences without bottlenecks in either self- or cross-attention KV cache.

Memory Allocation Overview

When NIM starts with the TRTLLM backend, it:

Reserves memory for model weights, activations, and runtime buffers
Allocates the remaining memory to the KV cache
Splits the KV cache into two pools for encoder-decoder models like Llama 3.2 Vision:

Text token pool (self-attention)

Image token pool (cross-attention)

The split ratio is determined at server startup. Requests exceeding either cache’s capacity are queued.

Example: Impact of Memory Split Ratio

Assuming that:

Each image corresponds to 1k image tokens
The free memory allows for 10k tokens to fit into the KV cache
The expected text sequence length is 1.5k (ISL+OSL)
All requests contain an image

If the memory split is chosen so that 20% of it is allocated to the image KV cache, there will be:

up to 8k tokens in the text KV cache
up to 2k tokens in the image KV cache

With that configuration, the server is bottlenecked by the image KV cache and only 2 requests (2k / 1k) will ever be able to be run concurrently. In that scenario, over half of the text KV cache will stay unused (5k / 8k).

If a better split is chosen, say 40%, there will be:

up to 6k tokens in the text KV cache
up to 4k tokens in the image KV cache

This scenario is perfectly balanced, as 4 requests can fit and occupy fully both KV caches (4 * 1k = 4k, 4 * 1.5k = 6k). Overall system throughput is improved thanks to the maximum concurrency now being 4 instead of 2.

Practical Configuration

The optimal split value depends on several factors like the relative number of cross-attention layers, their head size, variable-sized images, etc.

To simplify configuration, NIM for VLMs provides the NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN parameter.

For example, if a typical request has:

ISL=2000
OSL=500
maximum-size image

then NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN=2500 should be set.

If only 75% of the requests are expected to contain an image, the value should be adjusted:

NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN=2500/0.75.

If the images are not expected to be full-size, but half as small (which, in some models, corresponds to using half as many image tokens), then:

NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN=2500/(0.75*0.5)

General formula

NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN=(EXPECTED_ISL + EXPECTED_OSL) / (EXPECTED_RATIO_OF_IMAGE_REQUESTS * EXPECTED_IMAGE_SIZE_RATIO)

Warning

Setting NIM_ENCODER_KV_RATIO_TARGET_SEQ_LEN too low may prevent the text KV cache from be able to handle maximum supported sequence lengths.