Fine-Tuning with LoRA#
LoRA (Low-Rank Adaptation) lets you serve a base model plus one or more fine-tuned adapters without retraining or rebuilding the full model.
NIM LLM supports two LoRA serving modes:
Static LoRA: Adapters are discovered from a directory at startup.
Dynamic LoRA: Adapters can be loaded and unloaded while the server is running.
Initial LoRA Setup#
Before you configure static LoRA or dynamic LoRA, complete the shared setup in this section. These settings tell NIM LLM where to find adapters and ensure that the deployment uses a LoRA-capable profile.
Configure Adapter Discovery#
Complete the following setup for both static LoRA and dynamic LoRA:
Mount a directory (containing LoRA adapters) to the container.
Set
NIM_PEFT_SOURCEto that directory.Optionally, pass native vLLM LoRA flags.
Expected adapter layout under NIM_PEFT_SOURCE:
/opt/nim/loras/
├── adapter_a/
│ ├── adapter_config.json
│ └── adapter_model.safetensors # or adapter_model.bin
└── adapter_b/
├── adapter_config.json
└── adapter_model.bin
Only valid, readable adapter directories are loaded.
NIM passes the following LoRA-related flags to vLLM:
--enable-lora--max-loras--max-cpu-loras--max-lora-rank
Select a LoRA-Capable Profile#
When available for your model, select a -feat_lora profile so the deployment uses LoRA-compatible runtime settings.
Example:
export NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1-feat_lora
If you are not sure which profiles are available in your deployment, query the model and profile metadata from your environment. Then choose the LoRA-capable profile.
Static LoRA#
After you complete the required LoRA setup, NIM LLM uses static LoRA when NIM_PEFT_REFRESH_INTERVAL is not set. In static LoRA mode, the NIM container discovers adapters in NIM_PEFT_SOURCE during startup and loads valid adapters. If you add, remove, or update adapters after startup, restart the NIM container to apply those changes.
Dynamic LoRA#
Use dynamic LoRA when you need to add or remove adapters without restarting the deployment. You can manage adapters through directory monitoring, runtime API calls, or both.
Configure Dynamic LoRA Updates#
After you complete the shared LoRA setup, set NIM_PEFT_REFRESH_INTERVAL (polling interval in seconds) to enable dynamic LoRA through directory monitoring.
When NIM_PEFT_SOURCE and NIM_PEFT_REFRESH_INTERVAL are set, NIM starts the LoRA watcher and enables runtime LoRA updates for vLLM.
You can also use vLLM runtime endpoints for manual control:
POST /v1/load_lora_adapterPOST /v1/unload_lora_adapter
Load and Unload Adapters at Runtime#
To manage adapters through the directory watcher, use the following actions:
Load: copy a new adapter folder into
NIM_PEFT_SOURCE.Unload: remove an adapter folder from
NIM_PEFT_SOURCE.Wait one refresh interval for
/v1/modelsto reflect changes.
To manage adapters through the manual API, use the following actions:
Load with
POST /v1/load_lora_adapter.Unload with
POST /v1/unload_lora_adapter.
If you use both the watcher and the manual API together, an adapter that the API removes but that is still present in the directory can be reloaded by the watcher during the next scan.
Serve Multiple Adapters#
You can serve multiple adapters at the same time, subject to GPU memory limits.
Use /v1/models to discover available adapter IDs, then send the adapter ID in the request model field.
Code Examples#
The following example shows a minimal local workflow for serving a fine-tuned Llama model with LoRA. It includes shared setup, static and dynamic startup commands, model discovery, and an inference request that targets a loaded adapter.
Serve a Fine-Tuned Llama Model with LoRA#
To serve a fine-tuned Llama model with LoRA, complete the following steps:
Set up the environment variables and create the LoRA adapter directory.
# Common setup export NIM_CACHE_PATH=$PWD/.cache mkdir -p "$NIM_CACHE_PATH" export NGC_API_KEY=<your_ngc_api_key> export CUDA_VISIBLE_DEVICES=0 export NIM_MODEL_PROFILE=<lora-capable-profile> # Prepare adapters mkdir -p "$PWD/loras"
Start the model by using one of the following options:
Use static LoRA loading.
docker run -it --rm --gpus all \ -v "$NIM_CACHE_PATH:/opt/nim/.cache" \ -v "$PWD/loras:/opt/nim/loras" \ -p 8000:8000 \ -e NGC_API_KEY \ -e NIM_MODEL_PROFILE \ -e CUDA_VISIBLE_DEVICES \ -e NIM_PEFT_SOURCE=/opt/nim/loras \ <nim-llm-image>
Use dynamic LoRA loading with the watcher enabled.
docker run -it --rm --gpus all \ -v "$NIM_CACHE_PATH:/opt/nim/.cache" \ -v "$PWD/loras:/opt/nim/loras" \ -p 8000:8000 \ -e NGC_API_KEY \ -e NIM_MODEL_PROFILE \ -e CUDA_VISIBLE_DEVICES \ -e NIM_PEFT_SOURCE=/opt/nim/loras \ -e NIM_PEFT_REFRESH_INTERVAL=10 \ <nim-llm-image>
Verify that the models loaded.
curl -s localhost:8000/v1/models | jq
Send an inference request to an adapter.
curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my_lora_adapter", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 64 }' | jq