Large Language Models (RC15)

Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to new tasks. NIM supports LoRA PEFT adapters trained by the Nemo Framework and HuggingFace Transformers libraries. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models.

The block diagram below illustrates the architecture of dynamic multi-LoRA with NIM:

multi-lora-diagram.jpg

You can extend a NIM to serve LoRA models by using configuration defined in environment variables. The underlying NIM that you use must match the base model of the LoRAs. For example, to setup a LoRA adapter compatible with llama3-8b-instruct, such as llama3-8b-instruct-lora:hf-math-v1, configure the nvcr.io/nim/meta/llama3-8b-instruct NIM. The process of configuring a NIM to serve compatible LoRAs is described in the following sections.

You may download LoRA adapters from NGC or Huggingface, or you may choose to use your own custom LoRA adapters. However your LoRAs are sourced, each LoRA adapter must be stored in its own directory, and one or more LoRA directories within one LOCAL_PEFT_DIRECTORY directory. The names of the loaded LoRA adapters will match the name of the adapters’ directories. We support three formats of LoRAs: NeMo format, HuggingFace Transformers compatible format, and TensorRT-LLM compatible format.

NeMo Format

A NeMo-formatted LoRA should be a directory containing one .nemo file. The name of the .nemo file does not need to match the name of its parent directory. The supported target modules are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj", "attention_qkv"].

Huggingface Transformers Format

LoRA adapters trained with Huggingface Transformers are supported. The LoRA must contain an adapter_config.json file and one of {adapter_model.safetensors, adapter_model.bin} files. The supported target modules for NIM are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"]. You may choose to only use a subset of these modules.

TensorRT-LLM Format

A TensorRT-LLM formatted LoRA must contain one of {model.lora_config.npy, model.lora_config.bin} and one of {model.lora_weights.npy, model.lora_weights.bin} The supported target modules for NIM are ["attn_qkv", "attn_q", "attn_k", "attn_v", "attn_dense", "mlp_h_to_4h", "mlp_gate", "mlp_4h_to_h"]. Similarly to the Huggingface Transformers LoRAs, you may choose to only use a subset of these modules.

The directory used for storing one or more LoRAs (your LOCAL_PEFT_DIRECTORY) should be organized according to the following example. In this example, loras is the directory you’d pass into the docker container as your LOCAL_PEFT_DIRECTORY. Then the LoRAs that get loaded would be called llama3-8b-math, llama3-8b-math-hf, llama3-8b-squad, llama3-8b-math-trtllm, and llama3-8b-squad-hf.

Copy
Copied!
            

loras ├── llama3-8b-math │ └── llama3_8b_math.nemo ├── llama3-8b-math-hf │ ├── adapter_config.json │ └── adapter_model.bin ├── llama3-8b-squad │ └── squad.nemo ├── llama3-8b-math-trtllm │ ├── model.lora_config.npy │ └── model.lora_weights.npy └── llama3-8b-squad-hf ├── adapter_config.json └── adapter_model.safetensors

You can download pre-trained adapters from model registries or fine-tune custom adapters using popular frameworks like HuggingFace Transformers and NVIDIA NeMo to serve with NIM. Note that LoRA model weights are tied to a particular base model. You must only deploy LoRA models that were tuned with the same model as is being served by NIM.

Downloading LoRA adapters from NGC

Hint

If you do not have the ngc CLI tool, please see the instructions for downloading and configuring NGC CLI.

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY pushd $LOCAL_PEFT_DIRECTORY # downloading NeMo-format loras ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-math-v1" ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-squad-v1" # downloading vLLM-format loras ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-math-v1" ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-squad-v1" popd chmod -R 777 $LOCAL_PEFT_DIRECTORY

Downloading LoRA Adapters from Huggingface Hub

Hint

If you do not have the huggingface-cli CLI tool, install it via pip install -U "huggingface_hub[cli]".

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY # download a LoRA from Huggingface Hub mkdir $LOCAL_PEFT_DIRECTORY/llama3-lora huggingface-cli download <Huggingface LoRA name> adapter_config.json adapter_model.safetensors --local-dir $LOCAL_PEFT_DIRECTORY/llama3-lora chmod -R 777 $LOCAL_PEFT_DIRECTORY

Using Your Own Custom LoRA Adapters

If you’re using your own LoRA adapters that you’ve trained locally, create a $LOCAL_PEFT_DIRECTORY directory and copy your LoRA adapters to that directory. Your custom LoRA adapters must follow the file names and target modules specified in the previous section..

The LoRA modules in the Download LoRAs from NGC example use the NeMo framework to train the LoRA adapters. In the NeMo training framework, you can customize the adapter bottleneck dimension and specify the target modules for applying LoRA. LoRA can be applied to any linear layer within a transformer model, including:

  1. Q, K, V attention projections

  2. Attention output layer

  3. Either or both of the two transformer MLP layers

For QKV projections, NeMo’s attention implementation fuses QKV into a single projection, so the LoRA implementation learns a single low-rank projection for the combined QKV.

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY # download a LoRA from Huggingface cp <local lora path> $LOCAL_PEFT_DIRECTORY/llama3-lora chmod -R 777 $LOCAL_PEFT_DIRECTORY

PEFT is enabled in NIM by specifying environment variables. For a list of environment variables used for PEFT and what their default values are, see Environment Variables.

PEFT Caching and Dynamic Mixed-batch LoRA

As depicted below, LoRA inference is composed of three levels of PEFT (LoRA) storage and optimized kernels for mixed-batch LoRA inference.

  1. PEFT Source. Configured by NIM_PEFT_SOURCE, this is a directory where all the served LoRAs are stored for a particular model. There is no limit on the number of LoRAs that can be stored here. See LoRA Model Directory Structure for details on directory layout and supported formats and modules. NIM will discover the LoRAs in NIM_PEFT_SOURCE when it starts up.

    PEFT Source Dynamic Refreshing. NIM will also recheck and add any new LoRAs added to LOCAL_PEFT_DIRECTORY at a time interval (seconds) configured by NIM_PEFT_REFRESH_INTERVAL. If you add a new LoRA adapter, e.g. “new-lora”, to LOCAL_PEFT_DIRECTORY, “new-lora” will now be in NIM_PEFT_SOURCE. If NIM_PEFT_REFRESH_INTERVAL is set to 10, NIM will check NIM_PEFT_SOURCE for new models every 10 seconds. At the next refresh interval, NIM sees that “new-lora” is not in the existing list of LoRAs and will add it to the list of available models. If you check /v1/models after NIM_PEFT_REFRESH_INTERVAL seconds, you will now see “new-lora” in the list of models. The default value for NIM_PEFT_REFRESH_INTERVAL is None, meaning that once LoRAs are added from NIM_PEFT_SOURCE at start time, NIM will not check NIM_PEFT_SOURCE again and the service must be restarted if you wish to add new LoRAs to LOCAL_PEFT_DIRECTORY and have them show up in the list of available models.

  2. CPU PEFT Cache. This cache holds a subset of the LoRAs in NIM_PEFT_SOURCE in host memory. LoRAs are loaded into the CPU cache when a request is issued for that LoRA. To speed further requests, LoRAs are held in CPU memory until the cache is full and we need space for LoRA not in the cache the least recently used LoRA will be removed first. The size of the CPU PEFT cache is configured by setting NIM_MAX_CPU_LORAS. This environment variable determines the maximum number of LoRAs that can be held in CPU memory. The size of the CPU cache is also an upper bound on the number of different LoRAs there can be among all the active requests. If there are more active LoRAs than can fit in the cache, the service will return a 429 indicating the cache is full and you should reduce the number of active LoRAs.

  3. GPU PEFT Cache. This cache generally holds a subset of the LoRAs in the CPU PEFT cache. This is where LoRAs are held for inference. LoRAs are dynamically loaded into the GPU cache as they are scheduled for execution.
    Like with the CPU cache, LoRAs will remain in the GPU cache as long as there is space and the least recently used LoRA will be removed first. The size of the GPU cache is configured like the CPU cache by setting NIM_MAX_GPU_LORAS. The number of LoRAs that can fit in the GPU cache is an upper bound on the number of LoRAs that can be executed in the same batch. Note that larger numbers will cause higher memory usage.

Both the CPU and GPU caches will be preallocated, according to NIM_MAX_CPU_LORAS, NIM_MAX_GPU_LORAS, and NIM_MAX_LORA_RANK. NIM_MAX_LORA_RANK sets the maximum supported low rank (adapter size).

PEFT Cache memory requirements

The memory required for caching LoRAs is determined by the rank and number of LoRAs you wish to cache. The size of a LoRA is roughly low_rank * inner_dim * num_modules * num_layers, where inner_dim is the hidden dimension of the layer you are adapting, and num_modules is the number of modules you are adapting per layer (e.g. q, k, v tensors). Note that inner_dim can vary from module to module.

TensorRT-LLM backend: The cache is pre-allocated assuming all LoRA adapters have rank equal to NIM_MAX_LORA_RANK. LoRAs are not required to have the same rank, and if LoRAs with lower rank are used at inference time, more than the specified NIM_MAX_*_LORAS with fit into the cache. For example if the GPU cache was configured for 8 rank 64 LoRAs, we could run a batch of 32 rank 16 LoRAs.

In addition to cache for weights, the TensorRT-LLM engine will preallocate additional memory for LoRA activations. The required space scales relative to max_batch_size * max_lora_rank. NIM will automatically estimate the memory required for activations and the PEFT cache on start up, and reserve remaining memory for the key-value cache.

Export all of your non-default environment variables then run the server. If you use the four downloaded models from NGC, then you will have one base model and four LoRAs available for inference.

Copy
Copied!
            

export NIM_PEFT_SOURCE=/home/nvs/loras export NIM_PEFT_REFRESH_INTERVAL=3600 # will check NIM_PEFT_SOURCE for newly added models every hour export CONTAINER_NAME=meta-llama3-8b-instruct export NIM_CACHE_PATH=~/nim-cache chmod -R 777 $NIM_CACHE_PATH docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -e NIM_PEFT_SOURCE \ -e NIM_PEFT_REFRESH_INTERVAL \ -v $NIM_CACHE_PATH:/opt/nim/.cache \ -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \ -p 8000:8000 \ nvcr.io/nvidian/nim-llm-dev/meta-llama3-8b-instruct:24.05.rc15

The list of models available for inference can be found by running:

Copy
Copied!
            

curl -X GET 'http://0.0.0.0:8000/v1/models'

Output:

Copy
Copied!
            

{ "object": "list", "data": [ { "id": "meta-llama3-8b-instruct", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-8d8a74889cfb423c97b1002a0f0a0fa1", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-lora_vnemo-math-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-7c9916a6ba414093a6befe6e28937a34", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-lora_vhf-math-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-e88bf7b1b63e4a35b831e17e0b98cb67", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-lora_vnemo-squad-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-fbfcfd4e59974a0bad146d7ddda23f45", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] }, { "id": "llama3-8b-instruct-hf-squad-v1", "object": "model", "created": 1715702314, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-7a5509ab60f94e78b0433e7740b05934", "object": "model_permission", "created": 1715702314, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] } ] }

Next, submit a completion or chat completion inference request against the base model or any LoRA. You can make inference requests to any and all models returned by /v1/models. The first time you make an inference request to a LoRA adapter, there may be a loading time, but subsequent requests to that same LoRA adapter will have lower latency, as the weights are streamed from the cache.

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "llama3-8b-instruct-lora_vhf-math-v1", "prompt": "John buys 10 packs of magic cards. each pack of 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?", "max_tokens": 128 }'

This will produce the below output:

Copy
Copied!
            

{ "id": "cmpl-7996e1f532804a278535a632906bae07", "object": "text_completion", "created": 1715664944, "model": "llama3-8b-instruct-lora_vhf-math-v1", "choices": [ { "index": 0, "text": " (total) 10*20= <<10*20=200>>200\n200*1/4=<<200*1/4=50>>50\n50 of John's cards are uncommon cards.\n#### 50", "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 35, "total_tokens": 82, "completion_tokens": 47 } }

Previous Optimization
Next Acknowledgements
© Copyright © 2024, NVIDIA Corporation. Last updated on May 30, 2024.