Parameter-Efficient Fine-Tuning#
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to new tasks. NVIDIA NIM for LLMs (NIM for LLMs) supports LoRA PEFT adapters trained by the NeMo Framework and Hugging Face Transformers libraries. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models.
The following block diagram illustrates the architecture of dynamic multi-LoRA with NIM:
Adapters, trained using either the NVIDIA NeMo framework or Hugging Face PEFT library, are placed into an adapter store and given a unique name.
When making a request to the NIM, clients can specify that they want a particular customization by including the LoRA model name.
When the NIM receives a request for a customized model, it pulls the associated adapter from the adapter store into a multi-tier cache. Some adapters are resident in GPU memory and some in host memory, depending on how recently they were used.
During execution, the NIM runs specialized GPU kernels that enable data to simultaneously flow through both the foundation model and multiple different low-rank adapters. This technique enables the NIM to respond to requests for multiple different custom models at the same time.
LoRA Setup Overview#
You can extend a NIM to serve LoRA models by using configuration defined in environment variables. The underlying NIM that you use must match the base model of the LoRAs. For example, to setup a LoRA adapter compatible with llama3-8b-instruct
, such as llama3-8b-instruct-lora:hf-math-v1
, configure the nvcr.io/nim/meta/llama3-8b-instruct
NIM. The process of configuring a NIM to serve compatible LoRAs is described in the following sections.
LoRA Adapters#
Download LoRA adapters from NGC or Hugging Face, or use your own custom LoRA adapters. LoRA adapters must be stored in
separate directories, and one or more LoRA directories within the
LOCAL_PEFT_DIRECTORY
directory. The names of the loaded LoRA adapters must match the name of the adapters’ directories.
NIM for LLMs supports the NeMo format and the Hugging Face Transformers compatible format.
NeMo Format#
A NeMo-formatted LoRA directory must contain one file with the .nemo
extension.
The name of the .nemo
file does not need to match the name of its parent directory.
The supported target modules are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj", "attention_qkv"]
.
Hugging Face Transformers Format#
LoRA adapters trained with Hugging Face Transformers are supported.
The LoRA must contain an adapter_config.json
file and one of {adapter_model.safetensors
, adapter_model.bin
} files.
The supported target modules for NIM are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"]
.
LoRA Model Directory Structure#
The directory used for storing one or more LoRAs (your LOCAL_PEFT_DIRECTORY
) should be organized according to the following example.
In this example, loras
is the name of the directory you pass into the docker container as the value of LOCAL_PEFT_DIRECTORY
. Then the
LoRAs that get loaded would be called llama3-8b-math
, llama3-8b-math-hf
, llama3-8b-squad
, and llama3-8b-squad-hf
.
loras
├── llama3-8b-math
│ └── llama3_8b_math.nemo
├── llama3-8b-math-hf
│ ├── adapter_config.json
│ └── adapter_model.bin
├── llama3-8b-squad
│ └── squad.nemo
└── llama3-8b-squad-hf
├── adapter_config.json
└── adapter_model.safetensors
Obtaining LoRA models#
You can download pre-trained adapters from model registries or fine-tune custom adapters using popular frameworks such as Hugging Face Transformers and NVIDIA NeMo to serve with NIM for LLMs. Note that LoRA model weights are tied to a particular base model. You must only deploy LoRA models that were tuned with the same model as is being served by NIM.
Downloading LoRA adapters from NGC#
LoRA adapters for llama3-8b-instruct#
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY
# downloading NeMo-format loras
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
# downloading vLLM-format loras
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-math-v1"
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-squad-v1"
popd
chmod -R 777 $LOCAL_PEFT_DIRECTORY
LoRA adapters for llama3-70b-instruct#
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY
# downloading NeMo-format loras
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-squad-v1"
# downloading vLLM-format loras
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:hf-math-v1"
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:hf-squad-v1"
popd
chmod -R 777 $LOCAL_PEFT_DIRECTORY
Downloading LoRA Adapters from Hugging Face Hub#
If you do not have the huggingface-cli
CLI tool, install it using pip install -U "huggingface_hub[cli]"
.
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
# download a LoRA from Hugging Face Hub
mkdir $LOCAL_PEFT_DIRECTORY/llama3-lora
huggingface-cli download <Hugging Face LoRA name> adapter_config.json adapter_model.safetensors --local-dir $LOCAL_PEFT_DIRECTORY/llama3-lora
chmod -R 777 $LOCAL_PEFT_DIRECTORY
Using Your Own Custom LoRA Adapters#
If you’re using custom LoRA adapters that you’ve trained locally, create a $LOCAL_PEFT_DIRECTORY
directory
and copy your LoRA adapters to that directory.
The names of your custom LoRA adapters must follow the naming conventions described in the previous section.
The LoRA modules in the Download LoRAs from NGC example use the NeMo framework to train the LoRA adapters. Use the NeMo training framework to customize the adapter bottleneck dimension and specify the target modules for applying LoRA. LoRA can be applied to any linear layer within a transformer model, including:
Q, K, V attention projections
Attention output layer
Either or both of the two transformer MLP layers
For QKV projections, NeMo’s attention implementation fuses QKV into a single projection, so the LoRA implementation learns a single low-rank projection for the combined QKV.
The following example sets up the PEFT directory for NIM using a custom LoRA adapter that has already been trained
and exists at local_lora_path
.
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
# move a custom LoRA adapter to the local PEFT directory
cp local_lora_path $LOCAL_PEFT_DIRECTORY/llama3-lora
chmod -R 777 $LOCAL_PEFT_DIRECTORY
Handling Mixed Batch Requests#
The requests in one batch might use different LoRA adapters to support different tasks. Therefore, one traditional General Matrix Multiplication (GEMM) can’t be used to compute all the requests together. Computing them one-by-one sequentially would lead to significant additional overhead. To solve this problem, we used NVIDIA CUTLASS to implement a batched GEMM to fuse batched, heterogeneous request processing into a single kernel. This improves GPU utilization and performance.
PEFT Environment Variables#
You can enable PEFT in NIM for LLMs by setting the NIM_PEFT_SOURCE
environment variable. See
Environment Variables for further information.
PEFT Caching and Dynamic Mixed-batch LoRA (Multi-LoRA)#
LoRA inference is composed of three levels of PEFT (LoRA) storage and optimized kernels for mixed-batch LoRA inference.
PEFT Source. Configured by
NIM_PEFT_SOURCE
, this is a directory where all the served LoRAs are stored for a particular model. This environment variable must be set in order for PEFT LoRA to run with NIM. Any number of LoRAs that can be stored here – there is no limit. See LoRA Model Directory Structure for details on directory layout and supported formats and modules. NIM for LLMs searches for LoRAs inNIM_PEFT_SOURCE
when it starts.PEFT Source Dynamic Refreshing. If you set
NIM_PEFT_REFRESH_INTERVAL
, NIM for LLMs checks theLOCAL_PEFT_DIRECTORY
everyNIM_PEFT_REFRESH_INTERVAL
seconds and adds any new LoRAs it finds. If you add a new LoRA adapter, for examplenew-lora
, toLOCAL_PEFT_DIRECTORY
,new-lora
will now be inNIM_PEFT_SOURCE
. IfNIM_PEFT_REFRESH_INTERVAL
is set to 10, NIM for LLMs checksNIM_PEFT_SOURCE
for new models every 10 seconds. At the next refresh interval, NIM for LLMs detects that “new-lora” is not in the existing list of LoRAs and adds it to the list of available models. If you check/v1/models
afterNIM_PEFT_REFRESH_INTERVAL
seconds, you see “new-lora” in the list of models. The default value forNIM_PEFT_REFRESH_INTERVAL
is None, meaning that once LoRAs are added fromNIM_PEFT_SOURCE
at start time, NIM for LLMs does not checkNIM_PEFT_SOURCE
again and the service must be restarted if you wish to add new LoRAs toLOCAL_PEFT_DIRECTORY
and have them show up in the list of available models.CPU PEFT Cache. This cache holds a subset of the LoRAs in
NIM_PEFT_SOURCE
in host memory. LoRAs are loaded into the CPU cache when a request is issued for that LoRA. To speed further requests, LoRAs are held in CPU memory until the cache is full. When more space is needed for a LoRA not in the cache, the least recently used LoRA is removed. Specify the maximum number of LoRAs in the CPU PEFT cache by setting theNIM_MAX_CPU_LORAS
environment variable. This environment variable determines the maximum number of LoRAs that can be held in CPU memory. The size of the CPU cache is also an upper bound on the number of different LoRAs there can be among all the active requests. If there are more active LoRAs than can fit in the cache, the service returns a 429 indicating the cache is full and you should reduce the number of active LoRAs.GPU PEFT Cache. This cache generally holds a subset of the LoRAs in the CPU PEFT cache. This is where LoRAs are held for inference. LoRAs are dynamically loaded into the GPU cache as they are scheduled for execution. Like with the CPU cache, LoRAs remain in the GPU cache as long as there is space and the least recently used LoRA is removed first. The size of the GPU cache is configured by setting the
NIM_MAX_GPU_LORAS
environment variable. The number of LoRAs that can fit in the GPU cache is an upper bound on the number of LoRAs that can be executed in the same batch. Note that larger numbers cause higher memory usage.
Both the CPU and GPU caches will be pre-allocated, according to NIM_MAX_CPU_LORAS
, NIM_MAX_GPU_LORAS
, and NIM_MAX_LORA_RANK
.
NIM_MAX_LORA_RANK
sets the maximum supported low rank (adapter size).
PEFT Cache memory requirements#
The memory required for caching LoRAs is determined by the rank and number of LoRAs you wish to cache.
The size of a LoRA is roughly low_rank * inner_dim * num_modules * num_layers
, where inner_dim
is the hidden dimension of the layer you are adapting, and num_modules
is the number of modules you are adapting per layer (for example, q
, k
, and v
tensors). Note that inner_dim
can vary from module to module.
TensorRT-LLM backend: The cache is pre-allocated with enough memory for all the LoRAs to have NIM_MAX_LORA_RANK
.
LoRAs are not required to have the same rank, and if LoRAs with lower rank are used at inference time, more than the
specified NIM_MAX_GPU_LORAS
and NIM_MAX_CPU_LORAS
will fit into the cache.
For example, if the GPU cache is configured for 8 rank 64 LoRAs, NIM for LLMs can run a batch of 32 rank 16 LoRAs.
In addition to cache for weights, the TensorRT-LLM engine pre-allocates additional memory for LoRA activations.
The required space scales relative to max_batch_size * max_lora_rank
.
NIM for LLMs automatically estimates the memory required for activations and the PEFT cache on start up,
and reserves the remaining memory for the key-value cache.
Launch NIM for LLMs with PEFT#
This section includes setup instructions for LoRAs tuned on LLama 3.1 8B Instruct. The workflow to setup LoRAs for LLama 3.1 8B Instruct is similar. Note that the greater underlying model size of Llama3 70B results in greater memory requirements for its LoRA version. See PEFT Cache memory requirements.
Export all of your non-default environment variables then run the server. If you use the four downloaded models from NGC, then you will have one base model and four LoRAs available for inference. Refer to the GPU Selection section for a note about --gpus all
.
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
export NIM_PEFT_SOURCE=/home/nvs/loras
export NIM_PEFT_REFRESH_INTERVAL=3600 # will check NIM_PEFT_SOURCE for newly added models every hour
export CONTAINER_NAME=llama-3.1-8b-instruct
export NIM_CACHE_PATH=~/nim-cache
mkdir -p "$NIM_CACHE_PATH"
chmod -R 777 $NIM_CACHE_PATH
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_PEFT_SOURCE \
-e NIM_PEFT_REFRESH_INTERVAL \
-v $NIM_CACHE_PATH:/opt/nim/.cache \
-v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
Run Multi-LoRA Inference#
The list of models available for inference can be found by running:
curl -X GET 'http://0.0.0.0:8000/v1/models'
To make the output easier to read, pipe the results of curl
commands into a tool like jq
or python -m json.tool
.
For example: curl -s http://0.0.0.0:8000/v1/models | jq
.
Output:
{
"object": "list",
"data": [
{
"id": "meta/llama3-8b-instruct",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta/llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-8d8a74889cfb423c97b1002a0f0a0fa1",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-lora_vnemo-math-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta/llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-7c9916a6ba414093a6befe6e28937a34",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-lora_vhf-math-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta/llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-e88bf7b1b63e4a35b831e17e0b98cb67",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-lora_vnemo-squad-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta/llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-fbfcfd4e59974a0bad146d7ddda23f45",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-hf-squad-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta/llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-7a5509ab60f94e78b0433e7740b05934",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
Next, submit a completion or chat completion inference request against the base model or any LoRA.
You can make inference requests to any and all models returned by /v1/models
.
The first time you make an inference request to a LoRA adapter, there may be a loading time, but subsequent requests
to that same LoRA adapter will have lower latency, as the weights are streamed from the cache.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
"max_tokens": 128
}'
This produces the following output:
{
"id": "cmpl-7996e1f532804a278535a632906bae07",
"object": "text_completion",
"created": 1715664944,
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"choices": [
{
"index": 0,
"text": " (total) 10*20= <<10*20=200>>200\n200*1/4=<<200*1/4=50>>50\n50 of John's cards are uncommon cards.\n#### 50",
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 35,
"total_tokens": 82,
"completion_tokens": 47
}
}