LoRA Adapters#

LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.

Backend Support#

Backend

Status

Notes

vLLM

βœ…

Full support including KV-aware routing

SGLang

🚧

In progress

TensorRT-LLM

❌

Not yet supported

See the Feature Matrix for full compatibility details.

Overview#

Dynamo’s LoRA implementation provides:

  • Dynamic loading: Load and unload LoRA adapters at runtime without restarting workers

  • Multiple sources: Load from local filesystem (file://), S3-compatible storage (s3://), or Hugging Face Hub (hf://)

  • Automatic caching: Downloaded adapters are cached locally to avoid repeated downloads

  • Discovery integration: Loaded LoRAs are automatically registered and discoverable via /v1/models

  • KV-aware routing: Route requests to workers with the appropriate LoRA loaded

  • Kubernetes native: Declarative LoRA management via the DynamoModel CRD

Architecture#

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        LoRA Architecture                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Frontend   │────▢│    Router    │────▢│   Workers    β”‚     β”‚
β”‚  β”‚  /v1/models  β”‚     β”‚  LoRA-aware  β”‚     β”‚  LoRA-loaded β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                   β”‚              β”‚
β”‚                                                   β–Ό              β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚                              β”‚         LoRA Manager            β”‚ β”‚
β”‚                              β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚                              β”‚  β”‚ Downloaderβ”‚ β”‚    Cache    β”‚  β”‚ β”‚
β”‚                              β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                         β”‚                        β”‚
β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                     β–Ό                   β–Ό                   β–Ό   β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚              β”‚  file://   β”‚      β”‚   s3://    β”‚      β”‚  hf://  β”‚β”‚
β”‚              β”‚   Local    β”‚      β”‚  S3/MinIO  β”‚      β”‚(custom) β”‚β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The LoRA system consists of:

  • Rust Core (lib/llm/src/lora/): High-performance downloading, caching, and validation

  • Python Manager (components/src/dynamo/common/lora/): Extensible wrapper with custom source support

  • Worker Handlers (components/src/dynamo/vllm/handlers.py): Load/unload API and inference integration

Quick Start#

Prerequisites#

  • Dynamo installed with vLLM support

  • For S3 sources: AWS credentials configured

  • A LoRA adapter compatible with your base model

Local Development#

1. Start Dynamo with LoRA support:

# Start vLLM worker with LoRA flags
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
    python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
    --connector none \
    --enable-lora \
    --max-lora-rank 64

2. Load a LoRA adapter:

curl -X POST http://localhost:8081/v1/loras \
  -H "Content-Type: application/json" \
  -d '{
    "lora_name": "my-lora",
    "source": {
      "uri": "file:///path/to/my-lora"
    }
  }'

3. Run inference with the LoRA:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-lora",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

S3-Compatible Storage#

For production deployments, store LoRA adapters in S3-compatible storage:

# Configure S3 credentials
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_ENDPOINT=http://minio:9000  # For MinIO
export AWS_REGION=us-east-1

# Load LoRA from S3
curl -X POST http://localhost:8081/v1/loras \
  -H "Content-Type: application/json" \
  -d '{
    "lora_name": "customer-support-lora",
    "source": {
      "uri": "s3://my-loras/customer-support-v1"
    }
  }'

Configuration#

Environment Variables#

Variable

Description

Default

DYN_LORA_ENABLED

Enable LoRA adapter support

false

DYN_LORA_PATH

Local cache directory for downloaded LoRAs

~/.cache/dynamo_loras

AWS_ACCESS_KEY_ID

S3 access key (for s3:// URIs)

-

AWS_SECRET_ACCESS_KEY

S3 secret key (for s3:// URIs)

-

AWS_ENDPOINT

Custom S3 endpoint (for MinIO, etc.)

-

AWS_REGION

AWS region

us-east-1

AWS_ALLOW_HTTP

Allow HTTP (non-TLS) connections

false

vLLM Arguments#

Argument

Description

--enable-lora

Enable LoRA adapter support in vLLM

--max-lora-rank

Maximum LoRA rank (must be >= your LoRA’s rank)

--max-loras

Maximum number of LoRAs to load simultaneously

Backend API Reference#

Load LoRA#

Load a LoRA adapter from a source URI.

POST /v1/loras

Request:

{
  "lora_name": "string",
  "source": {
    "uri": "string"
  }
}

Response:

{
  "status": "success",
  "message": "LoRA adapter 'my-lora' loaded successfully",
  "lora_name": "my-lora",
  "lora_id": 1207343256
}

List LoRAs#

List all loaded LoRA adapters.

GET /v1/loras

Response:

{
  "status": "success",
  "loras": {
    "my-lora": 1207343256,
    "another-lora": 987654321
  },
  "count": 2
}

Unload LoRA#

Unload a LoRA adapter from the worker.

DELETE /v1/loras/{lora_name}

Response:

{
  "status": "success",
  "message": "LoRA adapter 'my-lora' unloaded successfully",
  "lora_name": "my-lora",
  "lora_id": 1207343256
}

Kubernetes Deployment#

For Kubernetes deployments, use the DynamoModel Custom Resource to declaratively manage LoRA adapters.

DynamoModel CRD#

apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: customer-support-lora
  namespace: dynamo-system
spec:
  modelName: customer-support-adapter-v1
  baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name in DGD
  modelType: lora
  source:
    uri: s3://my-models-bucket/loras/customer-support/v1

How It Works#

When you create a DynamoModel:

  1. Discovers endpoints: Finds all pods running your baseModelName

  2. Creates service: Automatically creates a Kubernetes Service

  3. Loads LoRA: Calls the LoRA load API on each endpoint

  4. Updates status: Reports which endpoints are ready

Verify Deployment#

# Check LoRA status
kubectl get dynamomodel customer-support-lora

# Expected output:
# NAME                    TOTAL   READY   AGE
# customer-support-lora   2       2       30s

For complete Kubernetes deployment details, see:

Examples#

Example

Description

Local LoRA with MinIO

Local development with S3-compatible storage

Kubernetes LoRA Deployment

Production deployment with DynamoModel CRD

Troubleshooting#

LoRA Fails to Load#

Check S3 connectivity:

# Verify LoRA exists in S3
aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive

Check cache directory:

ls -la ~/.cache/dynamo_loras/

Check worker logs:

# Look for LoRA-related messages
kubectl logs deployment/my-worker | grep -i lora

Model Not Found After Loading#

  • Verify the LoRA name matches exactly (case-sensitive)

  • Check if the LoRA is listed: curl http://localhost:8081/v1/loras

  • Ensure discovery registration succeeded (check worker logs)

Inference Returns Base Model Response#

  • Verify the model field in your request matches the lora_name

  • Check that the LoRA is loaded on the worker handling your request

  • For disaggregated serving, ensure both prefill and decode workers have the LoRA

See Also#