Fine-Tuning with LoRA#
LoRA (Low-Rank Adaptation) lets you serve a base model plus one or more fine-tuned adapters without retraining or rebuilding the full model.
LoRA Serving Modes#
NIM LLM supports two LoRA serving modes:
Static LoRA: Adapters are discovered from a directory at startup.
Dynamic LoRA: Adapters can be loaded and unloaded while the server is running.
Initial LoRA Setup#
Configure Adapter Discovery#
Complete the following setup for both static LoRA and dynamic LoRA:
Mount a directory (containing LoRA adapters) to the container.
Set
NIM_PEFT_SOURCEto that directory.Optionally, pass native vLLM LoRA flags.
Expected adapter layout under NIM_PEFT_SOURCE:
/opt/nim/loras/
├── adapter_a/
│ ├── adapter_config.json
│ └── adapter_model.safetensors # or adapter_model.bin
└── adapter_b/
├── adapter_config.json
└── adapter_model.bin
Only valid, readable adapter directories are loaded.
NIM passes the following LoRA-related flags to vLLM:
--enable-lora--max-loras--max-cpu-loras--max-lora-rank
Select a LoRA-Capable Profile#
When available for your model, select a -feat_lora profile so the deployment uses LoRA-compatible runtime settings.
Example:
export NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1-feat_lora
If you are not sure which profiles are available in your deployment, query model and profile metadata from your environment, and choose the LoRA-capable profile.
Static LoRA#
After you complete the required LoRA setup, NIM LLM uses static LoRA when NIM_PEFT_REFRESH_INTERVAL is not set. In static LoRA mode, the NIM container discovers adapters in NIM_PEFT_SOURCE during startup and loads the valid adapters. If you add, remove, or update adapters after startup, restart the NIM container to apply those changes.
Dynamic LoRA#
Configure Dynamic LoRA Updates#
After you complete the shared LoRA setup, set NIM_PEFT_REFRESH_INTERVAL to enable dynamic LoRA through directory monitoring.
NIM_PEFT_REFRESH_INTERVAL: polling interval in seconds.
When NIM_PEFT_SOURCE and NIM_PEFT_REFRESH_INTERVAL are set, NIM starts the LoRA watcher and enables runtime LoRA updates for vLLM.
You can also use vLLM runtime endpoints for manual control:
POST /v1/load_lora_adapterPOST /v1/unload_lora_adapter
Load and Unload Adapters at Runtime#
Directory-based flow:
Load: copy a new adapter folder into
NIM_PEFT_SOURCE.Unload: remove an adapter folder from
NIM_PEFT_SOURCE.Wait one refresh interval for
/v1/modelsto reflect changes.
Manual API flow:
Load with
POST /v1/load_lora_adapter.Unload with
POST /v1/unload_lora_adapter.
If you use both the watcher and the manual API together, an adapter that the API removes but that is still present in the directory can be reloaded by the watcher during the next scan.
Serve Multiple Adapters#
You can serve multiple adapters at the same time, subject to GPU memory limits.
Use /v1/models to discover available adapter IDs, then send the adapter ID in the request model field.
Code Examples#
Serve a Fine-Tuned Llama Model with LoRA#
# Common setup
export NIM_CACHE_PATH=$PWD/.cache
mkdir -p "$NIM_CACHE_PATH"
export NGC_API_KEY=<your_ngc_api_key>
export CUDA_VISIBLE_DEVICES=0
export NIM_MODEL_PROFILE=<lora-capable-profile>
# Prepare adapters
mkdir -p "$PWD/loras"
Static LoRA startup:
docker run -it --rm --gpus all \
-v "$NIM_CACHE_PATH:/opt/nim/.cache" \
-v "$PWD/loras:/opt/nim/loras" \
-p 8000:8000 \
-e NGC_API_KEY \
-e NIM_MODEL_PROFILE \
-e CUDA_VISIBLE_DEVICES \
-e NIM_PEFT_SOURCE=/opt/nim/loras \
<nim-llm-image>
Dynamic LoRA startup (watcher enabled):
docker run -it --rm --gpus all \
-v "$NIM_CACHE_PATH:/opt/nim/.cache" \
-v "$PWD/loras:/opt/nim/loras" \
-p 8000:8000 \
-e NGC_API_KEY \
-e NIM_MODEL_PROFILE \
-e CUDA_VISIBLE_DEVICES \
-e NIM_PEFT_SOURCE=/opt/nim/loras \
-e NIM_PEFT_REFRESH_INTERVAL=10 \
<nim-llm-image>
Verify loaded models:
curl -s localhost:8000/v1/models | jq
Use an adapter for inference:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_lora_adapter",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}' | jq