LoRA Adapters#

LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.

Backend Support#

Backend	Status	Notes
vLLM	✅	Full support including KV-aware routing
SGLang	🚧	In progress
TensorRT-LLM	❌	Not yet supported

See the Feature Matrix for full compatibility details.

Overview#

Dynamo’s LoRA implementation provides:

Dynamic loading: Load and unload LoRA adapters at runtime without restarting workers
Multiple sources: Load from local filesystem (file://), S3-compatible storage (s3://), or Hugging Face Hub (hf://)
Automatic caching: Downloaded adapters are cached locally to avoid repeated downloads
Discovery integration: Loaded LoRAs are automatically registered and discoverable via /v1/models
KV-aware routing: Route requests to workers with the appropriate LoRA loaded
Kubernetes native: Declarative LoRA management via the DynamoModel CRD

Architecture#

┌─────────────────────────────────────────────────────────────────┐
│                        LoRA Architecture                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     │
│  │   Frontend   │────▶│    Router    │────▶│   Workers    │     │
│  │  /v1/models  │     │  LoRA-aware  │     │  LoRA-loaded │     │
│  └──────────────┘     └──────────────┘     └──────────────┘     │
│                                                   │              │
│                                                   ▼              │
│                              ┌─────────────────────────────────┐ │
│                              │         LoRA Manager            │ │
│                              │  ┌───────────┐ ┌─────────────┐  │ │
│                              │  │ Downloader│ │    Cache    │  │ │
│                              │  └───────────┘ └─────────────┘  │ │
│                              └─────────────────────────────────┘ │
│                                         │                        │
│                     ┌───────────────────┼───────────────────┐   │
│                     ▼                   ▼                   ▼   │
│              ┌────────────┐      ┌────────────┐      ┌─────────┐│
│              │  file://   │      │   s3://    │      │  hf://  ││
│              │   Local    │      │  S3/MinIO  │      │(custom) ││
│              └────────────┘      └────────────┘      └─────────┘│
└─────────────────────────────────────────────────────────────────┘

The LoRA system consists of:

Rust Core (lib/llm/src/lora/): High-performance downloading, caching, and validation
Python Manager (components/src/dynamo/common/lora/): Extensible wrapper with custom source support
Worker Handlers (components/src/dynamo/vllm/handlers.py): Load/unload API and inference integration

Quick Start#

Prerequisites#

Dynamo installed with vLLM support
For S3 sources: AWS credentials configured
A LoRA adapter compatible with your base model

Local Development#

1. Start Dynamo with LoRA support:

# Start vLLM worker with LoRA flags
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
    python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
    --connector none \
    --enable-lora \
    --max-lora-rank 64

2. Load a LoRA adapter:

curl -X POST http://localhost:8081/v1/loras \
  -H "Content-Type: application/json" \
  -d '{
    "lora_name": "my-lora",
    "source": {
      "uri": "file:///path/to/my-lora"
    }
  }'

3. Run inference with the LoRA:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-lora",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

S3-Compatible Storage#

For production deployments, store LoRA adapters in S3-compatible storage:

# Configure S3 credentials
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_ENDPOINT=http://minio:9000  # For MinIO
export AWS_REGION=us-east-1

# Load LoRA from S3
curl -X POST http://localhost:8081/v1/loras \
  -H "Content-Type: application/json" \
  -d '{
    "lora_name": "customer-support-lora",
    "source": {
      "uri": "s3://my-loras/customer-support-v1"
    }
  }'

Configuration#

Environment Variables#

Variable	Description	Default
`DYN_LORA_ENABLED`	Enable LoRA adapter support	`false`
`DYN_LORA_PATH`	Local cache directory for downloaded LoRAs	`~/.cache/dynamo_loras`
`AWS_ACCESS_KEY_ID`	S3 access key (for `s3://` URIs)	-
`AWS_SECRET_ACCESS_KEY`	S3 secret key (for `s3://` URIs)	-
`AWS_ENDPOINT`	Custom S3 endpoint (for MinIO, etc.)	-
`AWS_REGION`	AWS region	`us-east-1`
`AWS_ALLOW_HTTP`	Allow HTTP (non-TLS) connections	`false`

vLLM Arguments#

Argument	Description
`--enable-lora`	Enable LoRA adapter support in vLLM
`--max-lora-rank`	Maximum LoRA rank (must be >= your LoRA’s rank)
`--max-loras`	Maximum number of LoRAs to load simultaneously

Backend API Reference#

Load LoRA#

Load a LoRA adapter from a source URI.

POST /v1/loras

Request:

{
  "lora_name": "string",
  "source": {
    "uri": "string"
  }
}

Response:

{
  "status": "success",
  "message": "LoRA adapter 'my-lora' loaded successfully",
  "lora_name": "my-lora",
  "lora_id": 1207343256
}

List LoRAs#

List all loaded LoRA adapters.

GET /v1/loras

Response:

{
  "status": "success",
  "loras": {
    "my-lora": 1207343256,
    "another-lora": 987654321
  },
  "count": 2
}

Unload LoRA#

Unload a LoRA adapter from the worker.

DELETE /v1/loras/{lora_name}

Response:

{
  "status": "success",
  "message": "LoRA adapter 'my-lora' unloaded successfully",
  "lora_name": "my-lora",
  "lora_id": 1207343256
}

Kubernetes Deployment#

For Kubernetes deployments, use the DynamoModel Custom Resource to declaratively manage LoRA adapters.

DynamoModel CRD#

apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: customer-support-lora
  namespace: dynamo-system
spec:
  modelName: customer-support-adapter-v1
  baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name in DGD
  modelType: lora
  source:
    uri: s3://my-models-bucket/loras/customer-support/v1

How It Works#

When you create a DynamoModel:

Discovers endpoints: Finds all pods running your baseModelName
Creates service: Automatically creates a Kubernetes Service
Loads LoRA: Calls the LoRA load API on each endpoint
Updates status: Reports which endpoints are ready

Verify Deployment#

# Check LoRA status
kubectl get dynamomodel customer-support-lora

# Expected output:
# NAME                    TOTAL   READY   AGE
# customer-support-lora   2       2       30s

For complete Kubernetes deployment details, see:

Managing Models with DynamoModel
Kubernetes LoRA Deployment Example

Examples#

Example	Description
Local LoRA with MinIO	Local development with S3-compatible storage
Kubernetes LoRA Deployment	Production deployment with DynamoModel CRD

Troubleshooting#

LoRA Fails to Load#

Check S3 connectivity:

# Verify LoRA exists in S3
aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive

Check cache directory:

ls -la ~/.cache/dynamo_loras/

Check worker logs:

# Look for LoRA-related messages
kubectl logs deployment/my-worker | grep -i lora

Model Not Found After Loading#

Verify the LoRA name matches exactly (case-sensitive)
Check if the LoRA is listed: curl http://localhost:8081/v1/loras
Ensure discovery registration succeeded (check worker logs)

Inference Returns Base Model Response#

Verify the model field in your request matches the lora_name
Check that the LoRA is loaded on the worker handling your request
For disaggregated serving, ensure both prefill and decode workers have the LoRA

LoRA Adapters#

Backend Support#

Overview#

Architecture#

Quick Start#

Prerequisites#

Local Development#

S3-Compatible Storage#

Configuration#

Environment Variables#

vLLM Arguments#

Backend API Reference#

Load LoRA#

List LoRAs#

Unload LoRA#

Kubernetes Deployment#

DynamoModel CRD#

How It Works#

Verify Deployment#

Examples#

Troubleshooting#

LoRA Fails to Load#

Model Not Found After Loading#

Inference Returns Base Model Response#

See Also#