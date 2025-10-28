Benchmarking LoRA Models#

Parameter-Efficient Fine-Tuning (PEFT) methods allow for efficient fine-tuning of large pretrained models. NIM currently supports Low-Rank Adaption (LoRA), which is a simple and efficient way to tailor LLMs for specific domains and use cases. With NIM, users can load and deploy multiple LoRA adapters. Follow the instructions in the Parameter-Efficient Fine-Tuning section to load HuggingFace or Nemo trained adapters in a directory and pass that directory to NIM as the value of an environment variable.

Once the adapters are added, you can query the LoRA model just like you would the base model, by replacing the base model ID with the LoRA model name, as shown in the following example.

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "llama3-8b-instruct-lora_vhf-math-v1", "prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?", "max_tokens": 128 }'

With GenAI-perf, you can benchmark the deployment metrics for LoRA models by passing the IDs of the LoRA models using the “-m” argument, as shown in the following example.

The following example benchmarks two LoRA models, provided that you have followed the instructions in [Parameter-Efficient Fine-Tuning](https://docs.nvidia.com/nim/large-language-models/latest/peft.html) to deploy llama3-8b-instruct-lora_vnemo-math-v1 and llama3-8b-instruct-lora_vhf-math-v1. Additionally, the “–model-selection-strategy {round_robin,random}” specifies whether these adapters should be called in a round-robin or random fashion.

genai-perf profile \ -m llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-instruct-lora_vhf-math-v1 \ --model-selection-strategy random \ --endpoint-type completions \ --service-kind openai \ --streaming \