Parameter-Efficient Fine-Tuning#

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to new tasks. NVIDIA NIM for LLMs (NIM for LLMs) supports LoRA PEFT adapters trained by the NeMo Framework and Hugging Face Transformers libraries. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models.

The following block diagram illustrates the architecture of dynamic multi-LoRA with NIM:

NIM In-flight LoRA Diagram

Adapters, trained using either the NVIDIA NeMo framework or Hugging Face PEFT library, are placed into an adapter store and given a unique name.
When making a request to the NIM, clients can specify that they want a particular customization by including the LoRA model name.
When the NIM receives a request for a customized model, it pulls the associated adapter from the adapter store into a multi-tier cache. Some adapters are resident in GPU memory and some in host memory, depending on how recently they were used.
During execution, the NIM runs specialized GPU kernels that enable data to simultaneously flow through both the foundation model and multiple different low-rank adapters. This technique enables the NIM to respond to requests for multiple different custom models at the same time.

LoRA Setup Overview#

You can extend a NIM to serve LoRA models by using configuration defined in environment variables. The underlying NIM that you use must match the base model of the LoRAs. For example, to setup a LoRA adapter compatible with llama3-8b-instruct, such as llama3-8b-instruct-lora:hf-math-v1, configure the nvcr.io/nim/meta/llama3-8b-instruct NIM. The process of configuring a NIM to serve compatible LoRAs is described in the following sections.

LoRA Adapters#

Download LoRA adapters from NGC or Hugging Face, or use your own custom LoRA adapters. LoRA adapters must be stored in separate directories, and one or more LoRA directories within the LOCAL_PEFT_DIRECTORY directory. The names of the loaded LoRA adapters must match the name of the adapters’ directories. NIM for LLMs supports the NeMo format and the Hugging Face Transformers compatible format.

NeMo Format#

A NeMo-formatted LoRA directory must contain one file with the .nemo extension. The name of the .nemo file does not need to match the name of its parent directory. The supported target modules are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj", "attention_qkv"].

Hugging Face Transformers Format#

LoRA adapters trained with Hugging Face Transformers are supported. The LoRA must contain an adapter_config.json file and one of {adapter_model.safetensors, adapter_model.bin} files. The supported target modules for NIM are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"].

LoRA Model Directory Structure#

The directory used for storing one or more LoRAs (your LOCAL_PEFT_DIRECTORY) should be organized according to the following example. In this example, loras is the name of the directory you pass into the docker container as the value of LOCAL_PEFT_DIRECTORY. Then the LoRAs that get loaded would be called llama3-8b-math, llama3-8b-math-hf, llama3-8b-squad, and llama3-8b-squad-hf.

loras
├── llama3-8b-math
│   └── llama3_8b_math.nemo
├── llama3-8b-math-hf
│   ├── adapter_config.json
│   └── adapter_model.bin
├── llama3-8b-squad
│   └── squad.nemo
└── llama3-8b-squad-hf
    ├── adapter_config.json
    └── adapter_model.safetensors

Obtaining LoRA models#

You can download pre-trained adapters from model registries or fine-tune custom adapters using popular frameworks such as Hugging Face Transformers and NVIDIA NeMo to serve with NIM for LLMs. Note that LoRA model weights are tied to a particular base model. You must only deploy LoRA models that were tuned with the same model as is being served by NIM.

Downloading LoRA adapters from NGC#

LoRA adapters for llama3-8b-instruct#

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY

# downloading NeMo-format loras
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"


# downloading vLLM-format loras
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-math-v1"
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-squad-v1"

popd

chmod -R 777 $LOCAL_PEFT_DIRECTORY

LoRA adapters for llama3-70b-instruct#

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY

# downloading NeMo-format loras
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-squad-v1"

# downloading vLLM-format loras
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:hf-math-v1"
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:hf-squad-v1"

popd

chmod -R 777 $LOCAL_PEFT_DIRECTORY

Downloading LoRA Adapters from Hugging Face Hub#

If you do not have the huggingface-cli CLI tool, install it using pip install -U "huggingface_hub[cli]".

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY

# download a LoRA from Hugging Face Hub
mkdir $LOCAL_PEFT_DIRECTORY/llama3-lora
huggingface-cli download <Hugging Face LoRA name> adapter_config.json adapter_model.safetensors --local-dir $LOCAL_PEFT_DIRECTORY/llama3-lora

chmod -R 777 $LOCAL_PEFT_DIRECTORY

Using Your Own Custom LoRA Adapters#

If you’re using custom LoRA adapters that you’ve trained locally, create a $LOCAL_PEFT_DIRECTORY directory and copy your LoRA adapters to that directory. The names of your custom LoRA adapters must follow the naming conventions described in the previous section.

The LoRA modules in the Download LoRAs from NGC example use the NeMo framework to train the LoRA adapters. Use the NeMo training framework to customize the adapter bottleneck dimension and specify the target modules for applying LoRA. LoRA can be applied to any linear layer within a transformer model, including:

Q, K, V attention projections
Attention output layer
Either or both of the two transformer MLP layers

For QKV projections, NeMo’s attention implementation fuses QKV into a single projection, so the LoRA implementation learns a single low-rank projection for the combined QKV.

The following example sets up the PEFT directory for NIM using a custom LoRA adapter that has already been trained and exists at local_lora_path.

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY

# move a custom LoRA adapter to the local PEFT directory
cp local_lora_path $LOCAL_PEFT_DIRECTORY/llama3-lora

chmod -R 777 $LOCAL_PEFT_DIRECTORY

Handling Mixed Batch Requests#

The requests in one batch might use different LoRA adapters to support different tasks. Therefore, one traditional General Matrix Multiplication (GEMM) can’t be used to compute all the requests together. Computing them one-by-one sequentially would lead to significant additional overhead. To solve this problem, we used NVIDIA CUTLASS to implement a batched GEMM to fuse batched, heterogeneous request processing into a single kernel. This improves ‌GPU utilization and performance.

PEFT Best Practices#

Consider the following best practices when using PEFT fine-tuned models:

Thoroughly evaluate the fine-tuned model before deployment. Test it on a diverse set of inputs that represent real-world usage patterns and measure it both quantitatively and qualitatively.
Ensure the inference endpoint matches the training format. For example, if you trained the model using prompt-completion pairs without applying a chat template, you must use the completion endpoint (/v1/completions). Using the chat endpoint (/v1/chat/completions) in such cases will cause a mismatch between training and inference, leading to unexpected outputs from the fine-tuned model.
Consider using structured generation or function calling. These methods can serve as supplementary or alternative ways to control model output formatting.
Optimize LoRA adapter sizes. Larger LoRA adapter ranks increase memory usage but may not yield proportional quality improvements. Start with smaller sizes, such as 16, before scaling up.
Exclude tokenizer-related files from custom Hugging Face LoRA adapters. Files such as tokenizer.json, tokenizer_config.json, and special_tokens_map.json should not be included, as NIM will automatically use these tokenizer files and override the ones provided with the base model. Any mismatch between these tokenizers, such as a different chat template, may lead to unexpected results when using the LoRA adapter.

PEFT Environment Variables#

You can enable PEFT in NIM for LLMs by setting the NIM_PEFT_SOURCE environment variable. See Environment Variables for further information.

PEFT Caching and Dynamic Mixed-batch LoRA (Multi-LoRA)#

LoRA inference is composed of three levels of PEFT (LoRA) storage and optimized kernels for mixed-batch LoRA inference.

PEFT Source. Configured by NIM_PEFT_SOURCE, this is a directory where all the served LoRAs are stored for a particular model. This environment variable must be set in order for PEFT LoRA to run with NIM. Any number of LoRAs that can be stored here – there is no limit. See LoRA Model Directory Structure for details on directory layout and supported formats and modules. NIM for LLMs searches for LoRAs in NIM_PEFT_SOURCE when it starts.

PEFT Source Dynamic Refreshing. If you set NIM_PEFT_REFRESH_INTERVAL, NIM for LLMs checks the LOCAL_PEFT_DIRECTORY every NIM_PEFT_REFRESH_INTERVAL seconds and adds any new LoRAs it finds. If you add a new LoRA adapter, for example new-lora, to LOCAL_PEFT_DIRECTORY, new-lora will now be in NIM_PEFT_SOURCE. If NIM_PEFT_REFRESH_INTERVAL is set to 10, NIM for LLMs checks NIM_PEFT_SOURCE for new models every 10 seconds. At the next refresh interval, NIM for LLMs detects that “new-lora” is not in the existing list of LoRAs and adds it to the list of available models. If you check /v1/models after NIM_PEFT_REFRESH_INTERVAL seconds, you see “new-lora” in the list of models. The default value for NIM_PEFT_REFRESH_INTERVAL is None, meaning that once LoRAs are added from NIM_PEFT_SOURCE at start time, NIM for LLMs does not check NIM_PEFT_SOURCE again and the service must be restarted if you wish to add new LoRAs to LOCAL_PEFT_DIRECTORY and have them show up in the list of available models.
CPU PEFT Cache. This cache holds a subset of the LoRAs in NIM_PEFT_SOURCE in host memory. LoRAs are loaded into the CPU cache when a request is issued for that LoRA. To speed further requests, LoRAs are held in CPU memory until the cache is full. When more space is needed for a LoRA not in the cache, the least recently used LoRA is removed. Specify the maximum number of LoRAs in the CPU PEFT cache by setting the NIM_MAX_CPU_LORAS environment variable. This environment variable determines the maximum number of LoRAs that can be held in CPU memory. The size of the CPU cache is also an upper bound on the number of different LoRAs there can be among all the active requests. If there are more active LoRAs than can fit in the cache, the service returns a 429 indicating the cache is full and you should reduce the number of active LoRAs.
GPU PEFT Cache. This cache generally holds a subset of the LoRAs in the CPU PEFT cache. This is where LoRAs are held for inference. LoRAs are dynamically loaded into the GPU cache as they are scheduled for execution. Like with the CPU cache, LoRAs remain in the GPU cache as long as there is space and the least recently used LoRA is removed first. The size of the GPU cache is configured by setting the NIM_MAX_GPU_LORAS environment variable. The number of LoRAs that can fit in the GPU cache is an upper bound on the number of LoRAs that can be executed in the same batch. Note that larger numbers cause higher memory usage.

Both the CPU and GPU caches will be pre-allocated, according to NIM_MAX_CPU_LORAS, NIM_MAX_GPU_LORAS, and NIM_MAX_LORA_RANK. NIM_MAX_LORA_RANK sets the maximum supported low rank (adapter size).

PEFT Cache memory requirements#

The memory required for caching LoRAs is determined by the rank and number of LoRAs you wish to cache. The size of a LoRA is roughly low_rank * inner_dim * num_modules * num_layers, where inner_dim is the hidden dimension of the layer you are adapting, and num_modules is the number of modules you are adapting per layer (for example, q, k, and v tensors). Note that inner_dim can vary from module to module.

TensorRT-LLM backend: The cache is pre-allocated with enough memory for all the LoRAs to have NIM_MAX_LORA_RANK. LoRAs are not required to have the same rank, and if LoRAs with lower rank are used at inference time, more than the specified NIM_MAX_GPU_LORAS and NIM_MAX_CPU_LORAS will fit into the cache. For example, if the GPU cache is configured for 8 rank 64 LoRAs, NIM for LLMs can run a batch of 32 rank 16 LoRAs.

In addition to cache for weights, the TensorRT-LLM engine pre-allocates additional memory for LoRA activations. The required space scales relative to max_batch_size * max_lora_rank. NIM for LLMs automatically estimates the memory required for activations and the PEFT cache on start up, and reserves the remaining memory for the key-value cache.

Launch NIM for LLMs with PEFT#

This section includes setup instructions for LoRAs tuned on LLama 3.1 8B Instruct. The workflow to setup LoRAs for LLama 3.1 8B Instruct is similar. Note that the greater underlying model size of Llama3 70B results in greater memory requirements for its LoRA version. See PEFT Cache memory requirements.

Export all of your non-default environment variables then run the server. If you use the four downloaded models from NGC, then you will have one base model and four LoRAs available for inference. Refer to the GPU Selection section for a note about --gpus all.

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
export NIM_PEFT_SOURCE=/home/nvs/loras
export NIM_PEFT_REFRESH_INTERVAL=3600   # will check NIM_PEFT_SOURCE for newly added models every hour
export CONTAINER_NAME=llama-3.1-8b-instruct

export NIM_CACHE_PATH=~/nim-cache
mkdir -p "$NIM_CACHE_PATH"
chmod -R 777 $NIM_CACHE_PATH

docker run -it --rm --name=$CONTAINER_NAME \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e NIM_PEFT_SOURCE \
    -e NIM_PEFT_REFRESH_INTERVAL \
    -v $NIM_CACHE_PATH:/opt/nim/.cache \
    -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

Run Multi-LoRA Inference#

The list of models available for inference can be found by running:

curl -X GET 'http://0.0.0.0:8000/v1/models'

To make the output easier to read, pipe the results of curl commands into a tool like jq or python -m json.tool. For example: curl -s http://0.0.0.0:8000/v1/models | jq.

Output:

{
  "object": "list",
  "data": [
    {
      "id": "meta/llama3-8b-instruct",
      "object": "model",
      "created": 1715702314,
      "owned_by": "vllm",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-8d8a74889cfb423c97b1002a0f0a0fa1",
          "object": "model_permission",
          "created": 1715702314,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    },
    {
      "id": "llama3-8b-instruct-lora_vnemo-math-v1",
      "object": "model",
      "created": 1715702314,
      "owned_by": "vllm",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-7c9916a6ba414093a6befe6e28937a34",
          "object": "model_permission",
          "created": 1715702314,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    },
    {
      "id": "llama3-8b-instruct-lora_vhf-math-v1",
      "object": "model",
      "created": 1715702314,
      "owned_by": "vllm",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-e88bf7b1b63e4a35b831e17e0b98cb67",
          "object": "model_permission",
          "created": 1715702314,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    },
    {
      "id": "llama3-8b-instruct-lora_vnemo-squad-v1",
      "object": "model",
      "created": 1715702314,
      "owned_by": "vllm",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-fbfcfd4e59974a0bad146d7ddda23f45",
          "object": "model_permission",
          "created": 1715702314,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    },
    {
      "id": "llama3-8b-instruct-hf-squad-v1",
      "object": "model",
      "created": 1715702314,
      "owned_by": "vllm",
      "root": "meta/llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-7a5509ab60f94e78b0433e7740b05934",
          "object": "model_permission",
          "created": 1715702314,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Next, submit a completion or chat completion inference request against the base model or any LoRA. You can make inference requests to any and all models returned by /v1/models. The first time you make an inference request to a LoRA adapter, there may be a loading time, but subsequent requests to that same LoRA adapter will have lower latency, as the weights are streamed from the cache.

curl -X 'POST' \
  'http://0.0.0.0:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3-8b-instruct-lora_vhf-math-v1",
    "prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
    "max_tokens": 128
  }'

This produces the following output:

{
  "id": "cmpl-7996e1f532804a278535a632906bae07",
  "object": "text_completion",
  "created": 1715664944,
  "model": "llama3-8b-instruct-lora_vhf-math-v1",
  "choices": [
    {
      "index": 0,
      "text": " (total) 10*20= <<10*20=200>>200\n200*1/4=<<200*1/4=50>>50\n50 of John's cards are uncommon cards.\n#### 50",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 35,
    "total_tokens": 82,
    "completion_tokens": 47
  }
}