Fine-Tuning with LoRA#

LoRA (Low-Rank Adaptation) lets you serve a base model plus one or more fine-tuned adapters without retraining or rebuilding the full model.

NIM LLM supports two LoRA serving modes:

  • Static LoRA: Adapters are discovered from a directory at startup.

  • Dynamic LoRA: Adapters can be loaded and unloaded while the server is running.

flowchart TD A[NIM container starts] --> B{LoRA mode} B -->|Static| C[Read adapters in NIM_PEFT_SOURCE] C --> D[Load valid adapters at startup] D --> E["/v1/models" includes base + adapters] B -->|Dynamic| F[Enable runtime adapter updates] F --> G[Watcher or API load and unload during runtime] G --> H["/v1/models" updates as adapters change]

Initial LoRA Setup#

Before you configure static LoRA or dynamic LoRA, complete the shared setup in this section. These settings tell NIM LLM where to find adapters and ensure that the deployment uses a LoRA-capable profile.

Configure Adapter Discovery#

Complete the following setup for both static LoRA and dynamic LoRA:

  1. Mount a directory (containing LoRA adapters) to the container.

  2. Set NIM_PEFT_SOURCE to that directory.

  3. Optionally, pass native vLLM LoRA flags.

Expected adapter layout under NIM_PEFT_SOURCE:

/opt/nim/loras/
├── adapter_a/
│   ├── adapter_config.json
│   └── adapter_model.safetensors   # or adapter_model.bin
└── adapter_b/
    ├── adapter_config.json
    └── adapter_model.bin

Only valid, readable adapter directories are loaded.

NIM passes the following LoRA-related flags to vLLM:

  • --enable-lora

  • --max-loras

  • --max-cpu-loras

  • --max-lora-rank

Select a LoRA-Capable Profile#

When available for your model, select a -feat_lora profile so the deployment uses LoRA-compatible runtime settings.

Example:

export NIM_MODEL_PROFILE=vllm-fp16-tp1-pp1-feat_lora

If you are not sure which profiles are available in your deployment, query the model and profile metadata from your environment. Then choose the LoRA-capable profile.

Static LoRA#

After you complete the required LoRA setup, NIM LLM uses static LoRA when NIM_PEFT_REFRESH_INTERVAL is not set. In static LoRA mode, the NIM container discovers adapters in NIM_PEFT_SOURCE during startup and loads valid adapters. If you add, remove, or update adapters after startup, restart the NIM container to apply those changes.

Dynamic LoRA#

Use dynamic LoRA when you need to add or remove adapters without restarting the deployment. You can manage adapters through directory monitoring, runtime API calls, or both.

Configure Dynamic LoRA Updates#

After you complete the shared LoRA setup, set NIM_PEFT_REFRESH_INTERVAL (polling interval in seconds) to enable dynamic LoRA through directory monitoring.

When NIM_PEFT_SOURCE and NIM_PEFT_REFRESH_INTERVAL are set, NIM starts the LoRA watcher and enables runtime LoRA updates for vLLM.

flowchart LR A[NIM_PEFT_SOURCE set] --> B[NIM_PEFT_REFRESH_INTERVAL set] B --> C[Watcher polls adapter directory] C --> D{Detected change} D -->|New adapter| E[Load adapter] D -->|Removed adapter| F[Unload adapter] E --> G["/v1/models" reflects loaded adapter] F --> H["/v1/models" no longer lists adapter]

You can also use runtime endpoints for manual control. The path depends on the active backend:

  • vLLM: POST /v1/load_lora_adapter, POST /v1/unload_lora_adapter

  • SGLang: POST /load_lora_adapter, POST /unload_lora_adapter (no /v1 prefix)

Load and Unload Adapters at Runtime#

To manage adapters through the directory watcher, use the following actions:

  • Load: copy a new adapter folder into NIM_PEFT_SOURCE.

  • Unload: remove an adapter folder from NIM_PEFT_SOURCE.

  • Wait one refresh interval for /v1/models to reflect changes.

To manage adapters through the manual API, use the following actions (the path varies by backend; see Reference an Adapter in a Request below):

  • vLLM: load with POST /v1/load_lora_adapter, unload with POST /v1/unload_lora_adapter.

  • SGLang: load with POST /load_lora_adapter, unload with POST /unload_lora_adapter.

If you use both the watcher and the manual API together, an adapter that the API removes but that is still present in the directory can be reloaded by the watcher during the next scan.

Serve Multiple Adapters#

You can serve multiple adapters at the same time, subject to GPU memory limits.

Use /v1/models to discover available adapter IDs, then send the adapter ID in the request model field. The exact shape depends on the active backend.

Reference an Adapter in a Request#

The model field on /v1/completions and /v1/chat/completions is interpreted differently by each backend.

  • vLLM dispatches by the registered adapter name verbatim. The string you see in /v1/models is the string you put in model.

    curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "my_lora_adapter", "messages": [{"role": "user", "content": "Hello!"}]}'
    
  • SGLang parses model as <base-model>:<adapter-name>. A request with model="<adapter-name>" and no colon is treated as a request to the base model with no LoRA applied: the response is a normal 200, but the adapter is not used. There is no error or warning.

    # Wrong: silent base-model fallback on SGLang
    curl ... -d '{"model": "my_lora_adapter", ...}'
    
    # Correct: SGLang colon form. <served-model-name> is the base model name
    # (the same string SGLang reports as its served model in /v1/models).
    curl ... -d '{"model": "<served-model-name>:my_lora_adapter", ...}'
    

    The left-hand side of the colon must be the served model name; the right-hand side must match an adapter listed in /v1/models. SGLang does not validate the left-hand side, but using the served model name keeps the request self-documenting.

If your client is portable across backends, branch on the backend tag (for example, the BACKEND_TYPE environment variable inside the container, or a probe of the model strings in /v1/models) before constructing the request body.

Code Examples#

The following example shows a minimal local workflow for serving a fine-tuned Llama model with LoRA. It includes shared setup, static and dynamic startup commands, model discovery, and an inference request that targets a loaded adapter.

Serve a Fine-Tuned Llama Model with LoRA#

To serve a fine-tuned Llama model with LoRA, complete the following steps:

  1. Set up the environment variables and create the LoRA adapter directory.

    # Common setup
    export LOCAL_NIM_CACHE=$PWD/.cache
    mkdir -p "$LOCAL_NIM_CACHE"
    export NGC_API_KEY=<your_ngc_api_key>
    export CUDA_VISIBLE_DEVICES=0
    export NIM_MODEL_PROFILE=<lora-capable-profile>
    
    # Prepare adapters
    mkdir -p "$PWD/loras"
    
  2. Start the model by using one of the following options:

    • Use static LoRA loading.

      docker run -it --rm --gpus all \
        -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
        -v "$PWD/loras:/opt/nim/loras" \
        -p 8000:8000 \
        -e NGC_API_KEY \
        -e NIM_MODEL_PROFILE \
        -e CUDA_VISIBLE_DEVICES \
        -e NIM_PEFT_SOURCE=/opt/nim/loras \
        <nim-llm-image>
      
    • Use dynamic LoRA loading with the watcher enabled.

      docker run -it --rm --gpus all \
        -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
        -v "$PWD/loras:/opt/nim/loras" \
        -p 8000:8000 \
        -e NGC_API_KEY \
        -e NIM_MODEL_PROFILE \
        -e CUDA_VISIBLE_DEVICES \
        -e NIM_PEFT_SOURCE=/opt/nim/loras \
        -e NIM_PEFT_REFRESH_INTERVAL=10 \
        <nim-llm-image>
      
  3. Verify that the models loaded.

    curl -s localhost:8000/v1/models | jq
    
  4. Send an inference request to an adapter. The model field shape depends on the active backend (see Reference an Adapter in a Request).

    # vLLM: bare adapter name dispatches the LoRA.
    curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "my_lora_adapter",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 64
      }' | jq
    
    # SGLang: prefix with the served model name and a colon, otherwise the
    # request silently falls back to the base model with no LoRA applied.
    curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "<served-model-name>:my_lora_adapter",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 64
      }' | jq