Benchmarking LoRA Models#

Parameter-Efficient Fine-Tuning (PEFT) methods allow for efficient fine-tuning of large pretrained models. NIM currently supports Low-Rank Adaption (LoRA), which is a simple and efficient way to tailor LLMs for specific domains and use cases. With NIM, users can load and deploy multiple LoRA adapters. Follow the instructions in the Parameter-Efficient Fine-Tuning section to load HuggingFace or Nemo trained adapters in a directory and pass that directory to NIM as the value of an environment variable.

Once the adapters are added, you can query the LoRA model just like you would the base model, by replacing the base model ID with the LoRA model name, as shown in the following example.

curl -X 'POST' \
  'http://0.0.0.0:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
"max_tokens": 128
}'

With GenAI-perf, you can benchmark the deployment metrics for LoRA models by passing the IDs of the LoRA models using the “-m” argument, as shown in the following example.

The following example benchmarks two LoRA models, provided that you have followed the instructions in [Parameter-Efficient Fine-Tuning](https://docs.nvidia.com/nim/large-language-models/latest/peft.html) to deploy llama3-8b-instruct-lora_vnemo-math-v1 and llama3-8b-instruct-lora_vhf-math-v1. Additionally, the “–model-selection-strategy {round_robin,random}” specifies whether these adapters should be called in a round-robin or random fashion.

genai-perf profile \
            -m llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-instruct-lora_vhf-math-v1 \
       --model-selection-strategy random \
            --endpoint-type completions \
            --service-kind openai \
            --streaming \

Best practices for Multi-LoRA deployment Performance Benchmarking#

Evaluating the latency and throughput performance of such a multi-LoRA deployment is nontrivial. This section describes several factors when benchmarking the performance of an LLM LoRA inference framework.

Base model: Both small and large models, such as [Llama 3 8B](https://build.nvidia.com/meta/llama3-8b) and [Llama 3 70B](https://build.nvidia.com/meta/llama3-70b), respectively, can be used as base models for LoRA fine-tuning and inference. Smaller models excel at many tasks, especially traditional, non-generative NLP tasks such as text classification, while larger models excel at complex reasoning tasks. One of the advantages of LoRA is that even a large 70B model can be tuned on a single [NVIDIA DGX H100](https://www.nvidia.com/en-us/data-center/dgx-h100/) or A100 node with FP16, or even a single [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) or [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPU with 4-bit quantization.
Adapters: Users typically prefer the flexibility to experiment and select the size that yields the best accuracy. System operators, on the other hand, might prefer to enforce a fixed size, as uniform LoRAs enable better batching and performance. The common choices for LoRA ranks are 8, 16, 32, and 64.
Test parameters: Several other test parameters to be considered for benchmarking include the following:
- Output length control: The ignore_eos parameter directs the inference framework to continue generating text until it reaches the max_token_length limit. This ensures that the use case OSL (output sequence length) specification is met. This parameter is increasingly supported by LLM inference frameworks and significantly simplifies setting up a benchmark. Notably, with ignore_eos you don’t have to train on real tasks to perform profiling.
- System load: Concurrency (number of concurrent users) is commonly used to drive load into the system. This parameter value should reflect real use cases, while also taking into account the maximum batch size that the system can effectively serve concurrently. For an 8B model on one GPU, consider up to 250 concurrent users as a realistic server load.
- Task type: You should consider both generative and non-generative tasks. These differ in the ISL (input sequence length) and OSL. ISL in the 200 to 2000 token range, and OSL in the one to 2000 token range reflect a wide range of LLM applications: from text classification and summary to translation and code generation.