Is this page helpful?

Fine-Tuning with LoRA#

LoRA (Low-Rank Adaptation) lets you serve a base model plus one or more fine-tuned adapters without retraining or rebuilding the full model.

NIM LLM supports two LoRA serving modes:

Static LoRA: Adapters are discovered from a directory at startup.
Dynamic LoRA: Adapters can be loaded and unloaded while the server is running.

flowchart TD A[NIM container starts] --> B{LoRA mode} B -->|Static| C[Read adapters in NIM_PEFT_SOURCE] C --> D[Load valid adapters at startup] D --> E["/v1/models" includes base + adapters] B -->|Dynamic| F[Enable runtime adapter updates] F --> G[Watcher or API load and unload during runtime] G --> H["/v1/models" updates as adapters change]

Initial LoRA Setup#

Before you configure static LoRA or dynamic LoRA, complete the shared setup in this section. These settings tell NIM LLM where to find adapters and ensure that the deployment uses a LoRA-capable profile.

Configure Adapter Discovery#

Complete the following setup for both static LoRA and dynamic LoRA:

Mount a directory (containing LoRA adapters) to the container.
Set NIM_PEFT_SOURCE to that directory.
Optionally, pass native vLLM LoRA flags.

Expected adapter layout under NIM_PEFT_SOURCE:

/opt/nim/loras/
├── adapter_a/
│   ├── adapter_config.json
│   └── adapter_model.safetensors   # or adapter_model.bin
└── adapter_b/
    ├── adapter_config.json
    └── adapter_model.bin

Only valid, readable adapter directories are loaded.

NIM passes the following LoRA-related flags to vLLM:

--enable-lora
--max-loras
--max-cpu-loras
--max-lora-rank

Select a LoRA-Capable Profile#

When available for your model, select a -feat_lora profile so the deployment uses LoRA-compatible runtime settings.

Example:

export NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1-feat_lora

If you are not sure which profiles are available in your deployment, query the model and profile metadata from your environment. Then choose the LoRA-capable profile.

Static LoRA#

After you complete the required LoRA setup, NIM LLM uses static LoRA when NIM_PEFT_REFRESH_INTERVAL is not set. In static LoRA mode, the NIM container discovers adapters in NIM_PEFT_SOURCE during startup and loads valid adapters. If you add, remove, or update adapters after startup, restart the NIM container to apply those changes.

Dynamic LoRA#

Use dynamic LoRA when you need to add or remove adapters without restarting the deployment. You can manage adapters through directory monitoring, runtime API calls, or both.

Configure Dynamic LoRA Updates#

After you complete the shared LoRA setup, set NIM_PEFT_REFRESH_INTERVAL (polling interval in seconds) to enable dynamic LoRA through directory monitoring.

When NIM_PEFT_SOURCE and NIM_PEFT_REFRESH_INTERVAL are set, NIM starts the LoRA watcher and enables runtime LoRA updates for vLLM.

flowchart LR A[NIM_PEFT_SOURCE set] --> B[NIM_PEFT_REFRESH_INTERVAL set] B --> C[Watcher polls adapter directory] C --> D{Detected change} D -->|New adapter| E[Load adapter] D -->|Removed adapter| F[Unload adapter] E --> G["/v1/models" reflects loaded adapter] F --> H["/v1/models" no longer lists adapter]

You can also use vLLM runtime endpoints for manual control:

POST /v1/load_lora_adapter
POST /v1/unload_lora_adapter

Load and Unload Adapters at Runtime#

To manage adapters through the directory watcher, use the following actions:

Load: copy a new adapter folder into NIM_PEFT_SOURCE.
Unload: remove an adapter folder from NIM_PEFT_SOURCE.
Wait one refresh interval for /v1/models to reflect changes.

To manage adapters through the manual API, use the following actions:

Load with POST /v1/load_lora_adapter.
Unload with POST /v1/unload_lora_adapter.

If you use both the watcher and the manual API together, an adapter that the API removes but that is still present in the directory can be reloaded by the watcher during the next scan.

Serve Multiple Adapters#

You can serve multiple adapters at the same time, subject to GPU memory limits.

Use /v1/models to discover available adapter IDs, then send the adapter ID in the request model field.

Code Examples#

The following example shows a minimal local workflow for serving a fine-tuned Llama model with LoRA. It includes shared setup, static and dynamic startup commands, model discovery, and an inference request that targets a loaded adapter.

Serve a Fine-Tuned Llama Model with LoRA#

To serve a fine-tuned Llama model with LoRA, complete the following steps:

Set up the environment variables and create the LoRA adapter directory.

# Common setup
export LOCAL_NIM_CACHE=$PWD/.cache
mkdir -p "$LOCAL_NIM_CACHE"
export NGC_API_KEY=<your_ngc_api_key>
export CUDA_VISIBLE_DEVICES=0
export NIM_MODEL_PROFILE=<lora-capable-profile>

# Prepare adapters
mkdir -p "$PWD/loras"

Start the model by using one of the following options:

Use static LoRA loading.

docker run -it --rm --gpus all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v "$PWD/loras:/opt/nim/loras" \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_MODEL_PROFILE \
  -e CUDA_VISIBLE_DEVICES \
  -e NIM_PEFT_SOURCE=/opt/nim/loras \
  <nim-llm-image>

Use dynamic LoRA loading with the watcher enabled.

docker run -it --rm --gpus all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v "$PWD/loras:/opt/nim/loras" \
  -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_MODEL_PROFILE \
  -e CUDA_VISIBLE_DEVICES \
  -e NIM_PEFT_SOURCE=/opt/nim/loras \
  -e NIM_PEFT_REFRESH_INTERVAL=10 \
  <nim-llm-image>

Verify that the models loaded.
```
curl -s localhost:8000/v1/models | jq
```

Send an inference request to an adapter.

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_lora_adapter",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }' | jq