Fine-Tuning with LoRA#

LoRA (Low-Rank Adaptation) lets you serve a base model plus one or more fine-tuned adapters without retraining or rebuilding the full model.

LoRA Serving Modes#

NIM LLM supports two LoRA serving modes:

Static LoRA: Adapters are discovered from a directory at startup.
Dynamic LoRA: Adapters can be loaded and unloaded while the server is running.

flowchart TD A[NIM container starts] --> B{LoRA mode} B -->|Static| C[Read adapters in NIM_PEFT_SOURCE] C --> D[Load valid adapters at startup] D --> E["/v1/models" includes base + adapters] B -->|Dynamic| F[Enable runtime adapter updates] F --> G[Watcher or API load and unload during runtime] G --> H["/v1/models" updates as adapters change]

Initial LoRA Setup#

Configure Adapter Discovery#

Complete the following setup for both static LoRA and dynamic LoRA:

Mount a directory (containing LoRA adapters) to the container.
Set NIM_PEFT_SOURCE to that directory.
Optionally, pass native vLLM LoRA flags.

Expected adapter layout under NIM_PEFT_SOURCE:

/opt/nim/loras/
├── adapter_a/
│   ├── adapter_config.json
│   └── adapter_model.safetensors   # or adapter_model.bin
└── adapter_b/
    ├── adapter_config.json
    └── adapter_model.bin

Only valid, readable adapter directories are loaded.

NIM passes the following LoRA-related flags to vLLM:

--enable-lora
--max-loras
--max-cpu-loras
--max-lora-rank

Select a LoRA-Capable Profile#

When available for your model, select a -feat_lora profile so the deployment uses LoRA-compatible runtime settings.

Example:

export NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1-feat_lora

If you are not sure which profiles are available in your deployment, query model and profile metadata from your environment, and choose the LoRA-capable profile.

Static LoRA#

After you complete the required LoRA setup, NIM LLM uses static LoRA when NIM_PEFT_REFRESH_INTERVAL is not set. In static LoRA mode, the NIM container discovers adapters in NIM_PEFT_SOURCE during startup and loads the valid adapters. If you add, remove, or update adapters after startup, restart the NIM container to apply those changes.

Dynamic LoRA#

Configure Dynamic LoRA Updates#

After you complete the shared LoRA setup, set NIM_PEFT_REFRESH_INTERVAL to enable dynamic LoRA through directory monitoring.

NIM_PEFT_REFRESH_INTERVAL: polling interval in seconds.

When NIM_PEFT_SOURCE and NIM_PEFT_REFRESH_INTERVAL are set, NIM starts the LoRA watcher and enables runtime LoRA updates for vLLM.

flowchart LR A[NIM_PEFT_SOURCE set] --> B[NIM_PEFT_REFRESH_INTERVAL set] B --> C[Watcher polls adapter directory] C --> D{Detected change} D -->|New adapter| E[Load adapter] D -->|Removed adapter| F[Unload adapter] E --> G["/v1/models" reflects loaded adapter] F --> H["/v1/models" no longer lists adapter]

You can also use vLLM runtime endpoints for manual control:

POST /v1/load_lora_adapter
POST /v1/unload_lora_adapter

Load and Unload Adapters at Runtime#

Directory-based flow:

Load: copy a new adapter folder into NIM_PEFT_SOURCE.
Unload: remove an adapter folder from NIM_PEFT_SOURCE.
Wait one refresh interval for /v1/models to reflect changes.

Manual API flow:

Load with POST /v1/load_lora_adapter.
Unload with POST /v1/unload_lora_adapter.

If you use both the watcher and the manual API together, an adapter that the API removes but that is still present in the directory can be reloaded by the watcher during the next scan.

Serve Multiple Adapters#

You can serve multiple adapters at the same time, subject to GPU memory limits.

Use /v1/models to discover available adapter IDs, then send the adapter ID in the request model field.

Code Examples#

Serve a Fine-Tuned Llama Model with LoRA#

# Common setup
export NIM_CACHE_PATH=$PWD/.cache
mkdir -p "$NIM_CACHE_PATH"
export NGC_API_KEY=<your_ngc_api_key>
export CUDA_VISIBLE_DEVICES=0
export NIM_MODEL_PROFILE=<lora-capable-profile>

# Prepare adapters
mkdir -p "$PWD/loras"

Static LoRA startup:

docker run -it --rm --gpus all \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  -v "$PWD/loras:/opt/nim/loras" \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_MODEL_PROFILE \
  -e CUDA_VISIBLE_DEVICES \
  -e NIM_PEFT_SOURCE=/opt/nim/loras \
  <nim-llm-image>

Dynamic LoRA startup (watcher enabled):

docker run -it --rm --gpus all \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  -v "$PWD/loras:/opt/nim/loras" \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_MODEL_PROFILE \
  -e CUDA_VISIBLE_DEVICES \
  -e NIM_PEFT_SOURCE=/opt/nim/loras \
  -e NIM_PEFT_REFRESH_INTERVAL=10 \
  <nim-llm-image>

Verify loaded models:

curl -s localhost:8000/v1/models | jq

Use an adapter for inference:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_lora_adapter",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }' | jq