LoRA Adapters#
LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.
Backend Support#
Backend |
Status |
Notes |
|---|---|---|
vLLM |
β |
Full support including KV-aware routing |
SGLang |
π§ |
In progress |
TensorRT-LLM |
β |
Not yet supported |
See the Feature Matrix for full compatibility details.
Overview#
Dynamoβs LoRA implementation provides:
Dynamic loading: Load and unload LoRA adapters at runtime without restarting workers
Multiple sources: Load from local filesystem (
file://), S3-compatible storage (s3://), or Hugging Face Hub (hf://)Automatic caching: Downloaded adapters are cached locally to avoid repeated downloads
Discovery integration: Loaded LoRAs are automatically registered and discoverable via
/v1/modelsKV-aware routing: Route requests to workers with the appropriate LoRA loaded
Kubernetes native: Declarative LoRA management via the
DynamoModelCRD
Architecture#
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LoRA Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Frontend ββββββΆβ Router ββββββΆβ Workers β β
β β /v1/models β β LoRA-aware β β LoRA-loaded β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββ β
β β LoRA Manager β β
β β βββββββββββββ βββββββββββββββ β β
β β β Downloaderβ β Cache β β β
β β βββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββΌββββββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββββ ββββββββββββββ ββββββββββββ
β β file:// β β s3:// β β hf:// ββ
β β Local β β S3/MinIO β β(custom) ββ
β ββββββββββββββ ββββββββββββββ ββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The LoRA system consists of:
Rust Core (
lib/llm/src/lora/): High-performance downloading, caching, and validationPython Manager (
components/src/dynamo/common/lora/): Extensible wrapper with custom source supportWorker Handlers (
components/src/dynamo/vllm/handlers.py): Load/unload API and inference integration
Quick Start#
Prerequisites#
Dynamo installed with vLLM support
For S3 sources: AWS credentials configured
A LoRA adapter compatible with your base model
Local Development#
1. Start Dynamo with LoRA support:
# Start vLLM worker with LoRA flags
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
--connector none \
--enable-lora \
--max-lora-rank 64
2. Load a LoRA adapter:
curl -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-lora",
"source": {
"uri": "file:///path/to/my-lora"
}
}'
3. Run inference with the LoRA:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-lora",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
S3-Compatible Storage#
For production deployments, store LoRA adapters in S3-compatible storage:
# Configure S3 credentials
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_ENDPOINT=http://minio:9000 # For MinIO
export AWS_REGION=us-east-1
# Load LoRA from S3
curl -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "customer-support-lora",
"source": {
"uri": "s3://my-loras/customer-support-v1"
}
}'
Configuration#
Environment Variables#
Variable |
Description |
Default |
|---|---|---|
|
Enable LoRA adapter support |
|
|
Local cache directory for downloaded LoRAs |
|
|
S3 access key (for |
- |
|
S3 secret key (for |
- |
|
Custom S3 endpoint (for MinIO, etc.) |
- |
|
AWS region |
|
|
Allow HTTP (non-TLS) connections |
|
vLLM Arguments#
Argument |
Description |
|---|---|
|
Enable LoRA adapter support in vLLM |
|
Maximum LoRA rank (must be >= your LoRAβs rank) |
|
Maximum number of LoRAs to load simultaneously |
Backend API Reference#
Load LoRA#
Load a LoRA adapter from a source URI.
POST /v1/loras
Request:
{
"lora_name": "string",
"source": {
"uri": "string"
}
}
Response:
{
"status": "success",
"message": "LoRA adapter 'my-lora' loaded successfully",
"lora_name": "my-lora",
"lora_id": 1207343256
}
List LoRAs#
List all loaded LoRA adapters.
GET /v1/loras
Response:
{
"status": "success",
"loras": {
"my-lora": 1207343256,
"another-lora": 987654321
},
"count": 2
}
Unload LoRA#
Unload a LoRA adapter from the worker.
DELETE /v1/loras/{lora_name}
Response:
{
"status": "success",
"message": "LoRA adapter 'my-lora' unloaded successfully",
"lora_name": "my-lora",
"lora_id": 1207343256
}
Kubernetes Deployment#
For Kubernetes deployments, use the DynamoModel Custom Resource to declaratively manage LoRA adapters.
DynamoModel CRD#
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: customer-support-lora
namespace: dynamo-system
spec:
modelName: customer-support-adapter-v1
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in DGD
modelType: lora
source:
uri: s3://my-models-bucket/loras/customer-support/v1
How It Works#
When you create a DynamoModel:
Discovers endpoints: Finds all pods running your
baseModelNameCreates service: Automatically creates a Kubernetes Service
Loads LoRA: Calls the LoRA load API on each endpoint
Updates status: Reports which endpoints are ready
Verify Deployment#
# Check LoRA status
kubectl get dynamomodel customer-support-lora
# Expected output:
# NAME TOTAL READY AGE
# customer-support-lora 2 2 30s
For complete Kubernetes deployment details, see:
Kubernetes LoRA Deployment Example
Examples#
Example |
Description |
|---|---|
Local LoRA with MinIO |
Local development with S3-compatible storage |
Kubernetes LoRA Deployment |
Production deployment with DynamoModel CRD |
Troubleshooting#
LoRA Fails to Load#
Check S3 connectivity:
# Verify LoRA exists in S3
aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive
Check cache directory:
ls -la ~/.cache/dynamo_loras/
Check worker logs:
# Look for LoRA-related messages
kubectl logs deployment/my-worker | grep -i lora
Model Not Found After Loading#
Verify the LoRA name matches exactly (case-sensitive)
Check if the LoRA is listed:
curl http://localhost:8081/v1/lorasEnsure discovery registration succeeded (check worker logs)
Inference Returns Base Model Response#
Verify the
modelfield in your request matches thelora_nameCheck that the LoRA is loaded on the worker handling your request
For disaggregated serving, ensure both prefill and decode workers have the LoRA
See Also#
Feature Matrix - Backend compatibility overview
vLLM Backend - vLLM-specific configuration
Dynamo Operator - Kubernetes operator overview
KV-Aware Routing - LoRA-aware request routing