---
title: LoRA Adapters
subtitle: Serve fine-tuned LoRA adapters with dynamic loading and routing in Dynamo
---
LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.
## Backend Support
| Backend | Status | Notes |
|---------|--------|-------|
| vLLM | ✅ | Full support including KV-aware routing |
| SGLang | 🚧 | In progress |
| TensorRT-LLM | ❌ | Not yet supported |
See the [Feature Matrix](/dynamo/dev/resources/feature-matrix) for full compatibility details.
## Overview
Dynamo's LoRA implementation provides:
- **Dynamic loading**: Load and unload LoRA adapters at runtime without restarting workers
- **Multiple sources**: Load from local filesystem (`file://`), S3-compatible storage (`s3://`), or Hugging Face Hub (`hf://`)
- **Automatic caching**: Downloaded adapters are cached locally to avoid repeated downloads
- **Discovery integration**: Loaded LoRAs are automatically registered and discoverable via `/v1/models`
- **KV-aware routing**: Route requests to workers with the appropriate LoRA loaded
- **Kubernetes native**: Declarative LoRA management via the `DynamoModel` CRD
### Architecture
```mermaid
flowchart TD
Frontend["Frontend"] --> Router["Router
(LoRA-aware)"]
Router --> Workers["Workers
(LoRA-loaded)"]
Workers --> ManagerNode["LoRA Manager"]
subgraph ManagerGroup["LoRA Manager"]
Downloader
Cache
end
ManagerNode --> Local["file://
Local"]
ManagerNode --> S3["s3://
S3/MinIO"]
ManagerNode --> HF["hf://
(custom)"]
```
The LoRA system consists of:
- **Rust Core** (`lib/llm/src/lora/`): High-performance downloading, caching, and validation
- **Python Manager** (`components/src/dynamo/common/lora/`): Extensible wrapper with custom source support
- **Worker Handlers** (`components/src/dynamo/vllm/handlers.py`): Load/unload API and inference integration
## Quick Start
### Prerequisites
- Dynamo installed with vLLM support
- For S3 sources: AWS credentials configured
- A LoRA adapter compatible with your base model
### Local Development
**1. Start Dynamo with LoRA support:**
```bash
# Start vLLM worker with LoRA flags
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
--enable-lora \
--max-lora-rank 64
```
**2. Load a LoRA adapter:**
```bash
curl -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-lora",
"source": {
"uri": "file:///path/to/my-lora"
}
}'
```
**3. Run inference with the LoRA:**
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-lora",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
### S3-Compatible Storage
For production deployments, store LoRA adapters in S3-compatible storage:
```bash
# Configure S3 credentials
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_ENDPOINT=http://minio:9000 # For MinIO
export AWS_REGION=us-east-1
# Load LoRA from S3
curl -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "customer-support-lora",
"source": {
"uri": "s3://my-loras/customer-support-v1"
}
}'
```
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DYN_LORA_ENABLED` | Enable LoRA adapter support | `false` |
| `DYN_LORA_PATH` | Local cache directory for downloaded LoRAs | `~/.cache/dynamo_loras` |
| `AWS_ACCESS_KEY_ID` | S3 access key (for `s3://` URIs) | - |
| `AWS_SECRET_ACCESS_KEY` | S3 secret key (for `s3://` URIs) | - |
| `AWS_ENDPOINT` | Custom S3 endpoint (for MinIO, etc.) | - |
| `AWS_REGION` | AWS region | `us-east-1` |
| `AWS_ALLOW_HTTP` | Allow HTTP (non-TLS) connections | `false` |
### vLLM Arguments
| Argument | Description |
|----------|-------------|
| `--enable-lora` | Enable LoRA adapter support in vLLM |
| `--max-lora-rank` | Maximum LoRA rank (must be >= your LoRA's rank) |
| `--max-loras` | Maximum number of LoRAs to load simultaneously |
## Backend API Reference
### Load LoRA
Load a LoRA adapter from a source URI.
```text
POST /v1/loras
```
**Request:**
```json
{
"lora_name": "string",
"source": {
"uri": "string"
}
}
```
**Response:**
```json
{
"status": "success",
"message": "LoRA adapter 'my-lora' loaded successfully",
"lora_name": "my-lora",
"lora_id": 1207343256
}
```
### List LoRAs
List all loaded LoRA adapters.
```text
GET /v1/loras
```
**Response:**
```json
{
"status": "success",
"loras": {
"my-lora": 1207343256,
"another-lora": 987654321
},
"count": 2
}
```
### Unload LoRA
Unload a LoRA adapter from the worker.
```text
DELETE /v1/loras/{lora_name}
```
**Response:**
```json
{
"status": "success",
"message": "LoRA adapter 'my-lora' unloaded successfully",
"lora_name": "my-lora",
"lora_id": 1207343256
}
```
## Kubernetes Deployment
For Kubernetes deployments, use the `DynamoModel` Custom Resource to declaratively manage LoRA adapters.
### DynamoModel CRD
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: customer-support-lora
namespace: dynamo-system
spec:
modelName: customer-support-adapter-v1
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in DGD
modelType: lora
source:
uri: s3://my-models-bucket/loras/customer-support/v1
```
### How It Works
When you create a `DynamoModel`:
1. **Discovers endpoints**: Finds all pods running your `baseModelName`
2. **Creates service**: Automatically creates a Kubernetes Service
3. **Loads LoRA**: Calls the LoRA load API on each endpoint
4. **Updates status**: Reports which endpoints are ready
### Verify Deployment
```bash
# Check LoRA status
kubectl get dynamomodel customer-support-lora
# Expected output:
# NAME TOTAL READY AGE
# customer-support-lora 2 2 30s
```
For complete Kubernetes deployment details, see:
- [Managing Models with DynamoModel](/dynamo/dev/kubernetes-deployment/deployment-guide/managing-models-with-dynamo-model)
- [Kubernetes LoRA Deployment Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/lora/README.md)
## Examples
| Example | Description |
|---------|-------------|
| [Local LoRA with MinIO](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/lora/README.md) | Local development with S3-compatible storage |
| [Kubernetes LoRA Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/lora/README.md) | Production deployment with DynamoModel CRD |
## Troubleshooting
### LoRA Fails to Load
**Check S3 connectivity:**
```bash
# Verify LoRA exists in S3
aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive
```
**Check cache directory:**
```bash
ls -la ~/.cache/dynamo_loras/
```
**Check worker logs:**
```bash
# Look for LoRA-related messages
kubectl logs deployment/my-worker | grep -i lora
```
### Model Not Found After Loading
- Verify the LoRA name matches exactly (case-sensitive)
- Check if the LoRA is listed: `curl http://localhost:8081/v1/loras`
- Ensure discovery registration succeeded (check worker logs)
### Inference Returns Base Model Response
- Verify the `model` field in your request matches the `lora_name`
- Check that the LoRA is loaded on the worker handling your request
- For disaggregated serving, ensure both prefill and decode workers have the LoRA
## KV Cache-Aware LoRA Routing
When KV-aware routing is enabled, the router automatically accounts for LoRA adapter identity when computing block hashes. This means:
- **Distinct hash spaces per adapter**: Blocks cached under adapter `A` will never be confused with blocks cached under adapter `B` or the base model, even if the token sequences are identical. The adapter name is mixed into the `LocalBlockHash` computation.
- **Automatic prefix sharing within the same adapter**: Requests targeting the same LoRA adapter benefit from KV cache prefix matching just like base model requests do.
- **No configuration required**: The LoRA name is propagated automatically through KV events (`BlockStored`) from the engine to the router. The router uses the `lora_name` field on events to route LoRA requests to workers that have matching cached blocks.
This works end-to-end across the publisher pipeline, the KV consolidator (for deduplication), and the routing query path.
## See Also
- [Feature Matrix](/dynamo/dev/resources/feature-matrix) - Backend compatibility overview
- [vLLM Backend](/dynamo/dev/backends/v-llm) - vLLM-specific configuration
- [Dynamo Operator](/dynamo/dev/kubernetes-deployment/deployment-guide/dynamo-operator) - Kubernetes operator overview
- [KV-Aware Routing](/dynamo/dev/user-guides/kv-cache-aware-routing) - LoRA-aware request routing
- [KV Events for Custom Engines](/dynamo/dev/integrations/kv-events-for-custom-engines) - Publishing LoRA-aware KV events