Parameter-Efficient Fine-Tuning

Large Language Models (Latest)

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to new tasks. NVIDIA NIM for LLMs supports LoRA PEFT adapters trained by the Nemo Framework and HuggingFace Transformers libraries. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models.

The following block diagram illustrates the architecture of dynamic multi-LoRA with NIM:

multi-lora-diagram.jpg

You can extend a NIM to serve LoRA models by using configuration defined in environment variables. The underlying NIM that you use must match the base model of the LoRAs. For example, to setup a LoRA adapter compatible with llama3-8b-instruct, such as llama3-8b-instruct-lora:hf-math-v1, configure the nvcr.io/nim/meta/llama3-8b-instruct NIM. The process of configuring a NIM to serve compatible LoRAs is described in the following sections.

Download LoRA adapters from NGC or Huggingface, or use your own custom LoRA adapters. LoRA adapters must be stored in separate directories, and one or more LoRA directories within the LOCAL_PEFT_DIRECTORY directory. The names of the loaded LoRA adapters must match the name of the adapters’ directories. NVIDIA NIM for LLMs supports the NeMo format and the HuggingFace Transformers compatible format.

NeMo Format

A NeMo-formatted LoRA directory must contain one file with the .nemo extension. The name of the .nemo file does not need to match the name of its parent directory. The supported target modules are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj", "attention_qkv"].

Huggingface Transformers Format

LoRA adapters trained with Huggingface Transformers are supported. The LoRA must contain an adapter_config.json file and one of {adapter_model.safetensors, adapter_model.bin} files. The supported target modules for NIM are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"].

The directory used for storing one or more LoRAs (your LOCAL_PEFT_DIRECTORY) should be organized according to the following example. In this example, loras is the name of the directory you pass into the docker container as the value of LOCAL_PEFT_DIRECTORY. Then the LoRAs that get loaded would be called llama3-8b-math, llama3-8b-math-hf, llama3-8b-squad, and llama3-8b-squad-hf.

Copy
Copied!
            

loras ├── llama3-8b-math │ └── llama3_8b_math.nemo ├── llama3-8b-math-hf │ ├── adapter_config.json │ └── adapter_model.bin ├── llama3-8b-squad │ └── squad.nemo └── llama3-8b-squad-hf ├── adapter_config.json └── adapter_model.safetensors

You can download pre-trained adapters from model registries or fine-tune custom adapters using popular frameworks such as HuggingFace Transformers and NVIDIA NeMo to serve with NVIDIA NIM for LLMs. Note that LoRA model weights are tied to a particular base model. You must only deploy LoRA models that were tuned with the same model as is being served by NIM.

Downloading LoRA adapters from NGC

LoRA adapters for llama3-8b-instruct

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY pushd $LOCAL_PEFT_DIRECTORY # downloading NeMo-format loras ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1" ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1" # downloading vLLM-format loras ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-math-v1" ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:hf-squad-v1" popd chmod -R 777 $LOCAL_PEFT_DIRECTORY

LoRA adapters for llama3-70b-instruct

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY pushd $LOCAL_PEFT_DIRECTORY # downloading NeMo-format loras ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-math-v1" ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-squad-v1" # downloading vLLM-format loras ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:hf-math-v1" ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:hf-squad-v1" popd chmod -R 777 $LOCAL_PEFT_DIRECTORY

Downloading LoRA Adapters from Huggingface Hub

Hint

If you do not have the huggingface-cli CLI tool, install it via pip install -U "huggingface_hub[cli]".

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY # download a LoRA from Huggingface Hub mkdir $LOCAL_PEFT_DIRECTORY/llama3-lora huggingface-cli download <Huggingface LoRA name> adapter_config.json adapter_model.safetensors --local-dir $LOCAL_PEFT_DIRECTORY/llama3-lora chmod -R 777 $LOCAL_PEFT_DIRECTORY

Using Your Own Custom LoRA Adapters

If you’re using custom LoRA adapters that you’ve trained locally, create a $LOCAL_PEFT_DIRECTORY directory and copy your LoRA adapters to that directory. The names of your custom LoRA adapters must follow the naming conventions described in the previous section.

The LoRA modules in the Download LoRAs from NGC example use the NeMo framework to train the LoRA adapters. Use the NeMo training framework to customize the adapter bottleneck dimension and specify the target modules for applying LoRA. LoRA can be applied to any linear layer within a transformer model, including:

  • Q, K, V attention projections

  • Attention output layer

  • Either or both of the two transformer MLP layers

For QKV projections, NeMo’s attention implementation fuses QKV into a single projection, so the LoRA implementation learns a single low-rank projection for the combined QKV.

The following example sets up the PEFT directory for NIM using a custom LoRA adapter that has already been trained and exists at <local lora path>.

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY # move a custom LoRA adapter to the local PEFT directory cp <local lora path> $LOCAL_PEFT_DIRECTORY/llama3-lora chmod -R 777 $LOCAL_PEFT_DIRECTORY

You can enable PEFT in NVIDIA NIM for LLMs by setting the NIM_PEFT_SOURCE environment variable. See Environment Variables for further information.

PEFT Caching and Dynamic Mixed-batch LoRA (Multi-LoRA)

LoRA inference is composed of three levels of PEFT (LoRA) storage and optimized kernels for mixed-batch LoRA inference.

  1. PEFT Source. Configured by NIM_PEFT_SOURCE, this is a directory where all the served LoRAs are stored for a particular model. This environment variable must be set in order for PEFT LoRA to run with NIM. There is no limit on the number of LoRAs that can be stored here. See LoRA Model Directory Structure for details on directory layout and supported formats and modules. NVIDIA NIM for LLMs searches for LoRAs in NIM_PEFT_SOURCE when it starts.

    PEFT Source Dynamic Refreshing. If you set NIM_PEFT_REFRESH_INTERVAL, NVIDIA NIM for LLMs checks the LOCAL_PEFT_DIRECTORY every NIM_PEFT_REFRESH_INTERVAL seconds and adds any new LoRAs it finds. If you add a new LoRA adapter, e.g. “new-lora”, to LOCAL_PEFT_DIRECTORY, “new-lora” will now be in NIM_PEFT_SOURCE. If NIM_PEFT_REFRESH_INTERVAL is set to 10, NVIDIA NIM for LLMs checks NIM_PEFT_SOURCE for new models every 10 seconds. At the next refresh interval, NVIDIA NIM for LLMs detects that “new-lora” is not in the existing list of LoRAs and adds it to the list of available models. If you check /v1/models after NIM_PEFT_REFRESH_INTERVAL seconds, you see “new-lora” in the list of models. The default value for NIM_PEFT_REFRESH_INTERVAL is None, meaning that once LoRAs are added from NIM_PEFT_SOURCE at start time, NVIDIA NIM for LLMs does not check NIM_PEFT_SOURCE again and the service must be restarted if you wish to add new LoRAs to LOCAL_PEFT_DIRECTORY and have them show up in the list of available models.

  2. CPU PEFT Cache. This cache holds a subset of the LoRAs in NIM_PEFT_SOURCE in host memory. LoRAs are loaded into the CPU cache when a request is issued for that LoRA. To speed further requests, LoRAs are held in CPU memory until the cache is full. When more space is needed for a LoRA not in the cache, the least recently used LoRA is removed. Specify the maximum number of LoRAs in the CPU PEFT cache by setting the NIM_MAX_CPU_LORAS environment variable. This environment variable determines the maximum number of LoRAs that can be held in CPU memory. The size of the CPU cache is also an upper bound on the number of different LoRAs there can be among all the active requests. If there are more active LoRAs than can fit in the cache, the service returns a 429 indicating the cache is full and you should reduce the number of active LoRAs.

  3. GPU PEFT Cache. This cache generally holds a subset of the LoRAs in the CPU PEFT cache. This is where LoRAs are held for inference. LoRAs are dynamically loaded into the GPU cache as they are scheduled for execution. Like with the CPU cache, LoRAs remain in the GPU cache as long as there is space and the least recently used LoRA is removed first. The size of the GPU cache is configured by setting the NIM_MAX_GPU_LORAS environment variable. The number of LoRAs that can fit in the GPU cache is an upper bound on the number of LoRAs that can be executed in the same batch. Note that larger numbers cause higher memory usage.

Both the CPU and GPU caches will be pre-allocated, according to NIM_MAX_CPU_LORAS, NIM_MAX_GPU_LORAS, and NIM_MAX_LORA_RANK. NIM_MAX_LORA_RANK sets the maximum supported low rank (adapter size).

PEFT Cache memory requirements

The memory required for caching LoRAs is determined by the rank and number of LoRAs you wish to cache. The size of a LoRA is roughly low_rank * inner_dim * num_modules * num_layers, where inner_dim is the hidden dimension of the layer you are adapting, and num_modules is the number of modules you are adapting per layer (e.g. q, k, v tensors). Note that inner_dim can vary from module to module.

TensorRT-LLM backend: The cache is pre-allocated with enough memory for all the LoRAs to have NIM_MAX_LORA_RANK. LoRAs are not required to have the same rank, and if LoRAs with lower rank are used at inference time, more than the specified NIM_MAX_GPU_LORAS and/or NIM_MAX_CPU_LORAS will fit into the cache. For example, if the GPU cache is configured for 8 rank 64 LoRAs, NVIDIA NIM for LLMs can run a batch of 32 rank 16 LoRAs.

In addition to cache for weights, the TensorRT-LLM engine preallocates additional memory for LoRA activations. The required space scales relative to max_batch_size * max_lora_rank. NVIDIA NIM for LLMs automatically estimates the memory required for activations and the PEFT cache on start up, and reserves the remaining memory for the key-value cache.

Note

This section includes setup instructions for LoRAs tuned on llama3-8b-instruct. The workflow to setup LoRAs for llama3-70b-instruct is similar. Note that the greater underlying model size results in greater memory requirements for llama3-70b LoRAs. See PEFT Cache memory requirements.

Export all of your non-default environment variables then run the server. If you use the four downloaded models from NGC, then you will have one base model and four LoRAs available for inference.

Copy
Copied!
            

export NIM_PEFT_SOURCE=/home/nvs/loras export NIM_PEFT_REFRESH_INTERVAL=3600 # will check NIM_PEFT_SOURCE for newly added models every hour export CONTAINER_NAME=meta-llama3-8b-instruct export NIM_CACHE_PATH=~/nim-cache chmod -R 777 $NIM_CACHE_PATH docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -e NIM_PEFT_SOURCE \ -e NIM_PEFT_REFRESH_INTERVAL \ -v $NIM_CACHE_PATH:/opt/nim/.cache \ -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \ -p 8000:8000 \ nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Run Multi-LoRA Inference

The list of models available for inference can be found by running:

Copy
Copied!
            

curl -X GET 'http://0.0.0.0:8000/v1/models'

Tip

Pipe the results of curl commands into a tool like jq or python -m json.tool to make the output of the API easier to read. For example: curl -s http://0.0.0.0:8000/v1/models | jq.

Output:

Copy
Copied!
            

{ "object": "list", "data": [ { "id": "meta/llama3-8b-instruct", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta/llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-8d8a74889cfb423c97b1002a0f0a0fa1", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-lora_vnemo-math-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta/llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-7c9916a6ba414093a6befe6e28937a34", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-lora_vhf-math-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta/llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-e88bf7b1b63e4a35b831e17e0b98cb67", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-lora_vnemo-squad-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta/llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-fbfcfd4e59974a0bad146d7ddda23f45", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-hf-squad-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta/llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-7a5509ab60f94e78b0433e7740b05934", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] } ] }

Next, submit a completion or chat completion inference request against the base model or any LoRA. You can make inference requests to any and all models returned by /v1/models. The first time you make an inference request to a LoRA adapter, there may be a loading time, but subsequent requests to that same LoRA adapter will have lower latency, as the weights are streamed from the cache.

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "llama3-8b-instruct-lora_vhf-math-v1", "prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?", "max_tokens": 128 }'

This will produce the below output:

Copy
Copied!
            

{ "id": "cmpl-7996e1f532804a278535a632906bae07", "object": "text_completion", "created": 1715664944, "model": "llama3-8b-instruct-lora_vhf-math-v1", "choices": [ { "index": 0, "text": " (total) 10*20= <<10*20=200>>200\n200*1/4=<<200*1/4=50>>50\n50 of John's cards are uncommon cards.\n#### 50", "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 35, "total_tokens": 82, "completion_tokens": 47 } }

Previous Optimization
Next Acknowledgements
© Copyright © 2024, NVIDIA Corporation. Last updated on Jun 7, 2024.