Parameter-Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to new tasks. NIM supports LoRA PEFT adapters trained by the Nemo Framework and HuggingFace Transformers libraries. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models.
The block diagram below illustrates the architecture of dynamic multi-LoRA with NIM:
You can extend a NIM to serve LoRA models by using configuration defined in environment variables. The underlying NIM that you use must match the base model of the LoRAs. For example, to setup a LoRA adapter compatible with llama3-8b-instruct
, such as llama3-8b-instruct-lora:hf-math-v1
, configure the nvcr.io/nim/meta/llama3-8b-instruct
NIM. The process of configuring a NIM to serve compatible LoRAs is described in the following sections.
You may download LoRA adapters from NGC or Huggingface, or you may choose to use your own custom LoRA adapters. However
your LoRAs are sourced, each LoRA adapter must be stored in its own directory, and one or more LoRA directories within one
LOCAL_PEFT_DIRECTORY
directory. The names of the loaded LoRA adapters will match the name of the adapters’ directories.
We support three formats of LoRAs: NeMo format, HuggingFace Transformers compatible format, and TensorRT-LLM compatible format.
NeMo Format
A NeMo-formatted LoRA should be a directory containing one .nemo
file.
The name of the .nemo
file does not need to match the name of its parent directory.
The supported target modules are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj", "attention_qkv"]
.
Huggingface Transformers Format
LoRA adapters trained with Huggingface Transformers are supported.
The LoRA must contain an adapter_config.json
file and one of {adapter_model.safetensors
, adapter_model.bin
} files.
The supported target modules for NIM are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"]
.
You may choose to only use a subset of these modules.
TensorRT-LLM Format
A TensorRT-LLM formatted LoRA must contain one of {model.lora_config.npy
, model.lora_config.bin
}
and one of {model.lora_weights.npy
, model.lora_weights.bin
}
The supported target modules for NIM are
["attn_qkv", "attn_q", "attn_k", "attn_v", "attn_dense", "mlp_h_to_4h", "mlp_gate", "mlp_4h_to_h"]
.
Similarly to the Huggingface Transformers LoRAs, you may choose to only use a subset of these modules.
The directory used for storing one or more LoRAs (your LOCAL_PEFT_DIRECTORY
) should be organized according to the following example.
In this example, loras
is the directory you’d pass into the docker container as your LOCAL_PEFT_DIRECTORY
. Then the
LoRAs that get loaded would be called llama3-8b-math
, llama3-8b-math-hf
, llama3-8b-squad
,
llama3-8b-math-trtllm
, and llama3-8b-squad-hf
.
loras
├── llama3-8b-math
│ └── llama3_8b_math.nemo
├── llama3-8b-math-hf
│ ├── adapter_config.json
│ └── adapter_model.bin
├── llama3-8b-squad
│ └── squad.nemo
├── llama3-8b-math-trtllm
│ ├── model.lora_config.npy
│ └── model.lora_weights.npy
└── llama3-8b-squad-hf
├── adapter_config.json
└── adapter_model.safetensors
You can download pre-trained adapters from model registries or fine-tune custom adapters using popular frameworks like HuggingFace Transformers and NVIDIA NeMo to serve with NIM. Note that LoRA model weights are tied to a particular base model. You must only deploy LoRA models that were tuned with the same model as is being served by NIM.
Downloading LoRA adapters from NGC
If you do not have the ngc
CLI tool, please see the instructions for downloading and configuring NGC CLI.
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY
# downloading NeMo-format loras
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-squad-v1"
# downloading vLLM-format loras
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-math-v1"
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-squad-v1"
popd
chmod -R 777 $LOCAL_PEFT_DIRECTORY
Downloading LoRA Adapters from Huggingface Hub
If you do not have the huggingface-cli
CLI tool, install it via pip install -U "huggingface_hub[cli]"
.
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
# download a LoRA from Huggingface Hub
mkdir $LOCAL_PEFT_DIRECTORY/llama3-lora
huggingface-cli download <Huggingface LoRA name> adapter_config.json adapter_model.safetensors --local-dir $LOCAL_PEFT_DIRECTORY/llama3-lora
chmod -R 777 $LOCAL_PEFT_DIRECTORY
Using Your Own Custom LoRA Adapters
If you’re using your own LoRA adapters that you’ve trained locally, create a $LOCAL_PEFT_DIRECTORY
directory and copy your LoRA adapters to that directory.
Your custom LoRA adapters must follow the file names and target modules specified in the previous section..
The LoRA modules in the Download LoRAs from NGC example use the NeMo framework to train the LoRA adapters. In the NeMo training framework, you can customize the adapter bottleneck dimension and specify the target modules for applying LoRA. LoRA can be applied to any linear layer within a transformer model, including:
Q, K, V attention projections
Attention output layer
Either or both of the two transformer MLP layers
For QKV projections, NeMo’s attention implementation fuses QKV into a single projection, so the LoRA implementation learns a single low-rank projection for the combined QKV.
export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
# download a LoRA from Huggingface
cp <local lora path> $LOCAL_PEFT_DIRECTORY/llama3-lora
chmod -R 777 $LOCAL_PEFT_DIRECTORY
PEFT is enabled in NIM by specifying environment variables. For a list of environment variables used for PEFT and what their default values are, see Environment Variables.
PEFT Caching and Dynamic Mixed-batch LoRA
As depicted below, LoRA inference is composed of three levels of PEFT (LoRA) storage and optimized kernels for mixed-batch LoRA inference.
PEFT Source. Configured by
NIM_PEFT_SOURCE
, this is a directory where all the served LoRAs are stored for a particular model. There is no limit on the number of LoRAs that can be stored here. See LoRA Model Directory Structure for details on directory layout and supported formats and modules. NIM will discover the LoRAs inNIM_PEFT_SOURCE
when it starts up.PEFT Source Dynamic Refreshing. NIM will also recheck and add any new LoRAs added to
LOCAL_PEFT_DIRECTORY
at a time interval (seconds) configured byNIM_PEFT_REFRESH_INTERVAL
. If you add a new LoRA adapter, e.g. “new-lora”, toLOCAL_PEFT_DIRECTORY
, “new-lora” will now be inNIM_PEFT_SOURCE
. IfNIM_PEFT_REFRESH_INTERVAL
is set to 10, NIM will checkNIM_PEFT_SOURCE
for new models every 10 seconds. At the next refresh interval, NIM sees that “new-lora” is not in the existing list of LoRAs and will add it to the list of available models. If you check/v1/models
afterNIM_PEFT_REFRESH_INTERVAL
seconds, you will now see “new-lora” in the list of models. The default value forNIM_PEFT_REFRESH_INTERVAL
is None, meaning that once LoRAs are added fromNIM_PEFT_SOURCE
at start time, NIM will not checkNIM_PEFT_SOURCE
again and the service must be restarted if you wish to add new LoRAs toLOCAL_PEFT_DIRECTORY
and have them show up in the list of available models.CPU PEFT Cache. This cache holds a subset of the LoRAs in
NIM_PEFT_SOURCE
in host memory. LoRAs are loaded into the CPU cache when a request is issued for that LoRA. To speed further requests, LoRAs are held in CPU memory until the cache is full and we need space for LoRA not in the cache the least recently used LoRA will be removed first. The size of the CPU PEFT cache is configured by settingNIM_MAX_CPU_LORAS
. This environment variable determines the maximum number of LoRAs that can be held in CPU memory. The size of the CPU cache is also an upper bound on the number of different LoRAs there can be among all the active requests. If there are more active LoRAs than can fit in the cache, the service will return a 429 indicating the cache is full and you should reduce the number of active LoRAs.GPU PEFT Cache. This cache generally holds a subset of the LoRAs in the CPU PEFT cache. This is where LoRAs are held for inference. LoRAs are dynamically loaded into the GPU cache as they are scheduled for execution.
Like with the CPU cache, LoRAs will remain in the GPU cache as long as there is space and the least recently used LoRA will be removed first. The size of the GPU cache is configured like the CPU cache by settingNIM_MAX_GPU_LORAS
. The number of LoRAs that can fit in the GPU cache is an upper bound on the number of LoRAs that can be executed in the same batch. Note that larger numbers will cause higher memory usage.
Both the CPU and GPU caches will be preallocated, according to NIM_MAX_CPU_LORAS
, NIM_MAX_GPU_LORAS
, and NIM_MAX_LORA_RANK
.
NIM_MAX_LORA_RANK
sets the maximum supported low rank (adapter size).
PEFT Cache memory requirements
The memory required for caching LoRAs is determined by the rank and number of LoRAs you wish to cache.
The size of a LoRA is roughly low_rank * inner_dim * num_modules * num_layers
, where inner_dim
is the hidden dimension of the layer you are adapting, and num_modules
is the number of modules you are adapting per layer (e.g. q, k, v tensors). Note that inner_dim
can vary from module to module.
TensorRT-LLM backend: The cache is pre-allocated assuming all LoRA adapters have rank equal to NIM_MAX_LORA_RANK
.
LoRAs are not required to have the same rank, and if LoRAs with lower rank are used at inference time, more than the specified NIM_MAX_*_LORAS with fit into the cache.
For example if the GPU cache was configured for 8 rank 64 LoRAs, we could run a batch of 32 rank 16 LoRAs.
In addition to cache for weights, the TensorRT-LLM engine will preallocate additional memory for LoRA activations.
The required space scales relative to max_batch_size * max_lora_rank
.
NIM will automatically estimate the memory required for activations and the PEFT cache on start up, and reserve remaining memory for the key-value cache.
Export all of your non-default environment variables then run the server. If you use the four downloaded models from NGC, then you will have one base model and four LoRAs available for inference.
export NIM_PEFT_SOURCE=/home/nvs/loras
export NIM_PEFT_REFRESH_INTERVAL=3600 # will check NIM_PEFT_SOURCE for newly added models every hour
export CONTAINER_NAME=meta-llama3-8b-instruct
export NIM_CACHE_PATH=~/nim-cache
chmod -R 777 $NIM_CACHE_PATH
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-e NIM_PEFT_SOURCE \
-e NIM_PEFT_REFRESH_INTERVAL \
-v $NIM_CACHE_PATH:/opt/nim/.cache \
-v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
-p 8000:8000 \
nvcr.io/nvidian/nim-llm-dev/meta-llama3-8b-instruct:24.05.rc15
The list of models available for inference can be found by running:
curl -X GET 'http://0.0.0.0:8000/v1/models'
Output:
{
"object": "list",
"data": [
{
"id": "meta-llama3-8b-instruct",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-8d8a74889cfb423c97b1002a0f0a0fa1",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-lora_vnemo-math-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-7c9916a6ba414093a6befe6e28937a34",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-lora_vhf-math-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-e88bf7b1b63e4a35b831e17e0b98cb67",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-lora_vnemo-squad-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-fbfcfd4e59974a0bad146d7ddda23f45",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
},
{
"id": "llama3-8b-instruct-hf-squad-v1",
"object": "model",
"created": 1715702314,
"owned_by": "vllm",
"root": "meta-llama3-8b-instruct",
"parent": null,
"permission": [
{
"id": "modelperm-7a5509ab60f94e78b0433e7740b05934",
"object": "model_permission",
"created": 1715702314,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
Next, submit a completion or chat completion inference request against the base model or any LoRA.
You can make inference requests to any and all models returned by /v1/models
.
The first time you make an inference request to a LoRA adapter, there may be a loading time, but subsequent requests
to that same LoRA adapter will have lower latency, as the weights are streamed from the cache.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"prompt": "John buys 10 packs of magic cards. each pack of 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
"max_tokens": 128
}'
This will produce the below output:
{
"id": "cmpl-7996e1f532804a278535a632906bae07",
"object": "text_completion",
"created": 1715664944,
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"choices": [
{
"index": 0,
"text": " (total) 10*20= <<10*20=200>>200\n200*1/4=<<200*1/4=50>>50\n50 of John's cards are uncommon cards.\n#### 50",
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 35,
"total_tokens": 82,
"completion_tokens": 47
}
}