Benchmarking LoRA Models

NIM for LLM Benchmarking Guide (Latest)

Parameter-Efficient Fine-Tuning (PEFT) methods allow for efficient fine-tuning of large pretrained models. NIM currently supports Low-Rank Adaption (LoRA), which is a simple and efficient way to tailor LLMs for specific domains and use cases. With NIM, users can load and deploy multiple LoRA adapters. Follow the instructions in the Parameter-Efficient Fine-Tuning section to load HuggingFace or Nemo trained adapters in a directory and pass that directory to NIM as the value of an environment variable.

Once the adapters are added, you can query the LoRA model just like you would the base model, by replacing the base model ID with the LoRA model name, as shown in the following example.

Copy
Copied!
            

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "llama3-8b-instruct-lora_vhf-math-v1", "prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?", "max_tokens": 128 }'

With GenAI-perf, you can benchmark the deployment metrics for LoRA models by passing the IDs of the LoRA models using the “-m” argument, as as shown in the following example.

Copy
Copied!
            

genai-perf \ -m llama-3-8b-lora_1 llama-3-8b-lora_2 llama-3-8b-lora_3 \ --model-selection-strategy random \ --endpoint-type completions \ --service-kind openai \ --streaming \

This example tests three LoRA models: llama-3-8b-lora_1, llama-3-8b-lora_2 and llama-3-8b-lora_3. Additionally, the –model-selection-strategy random option specifies whether that the adapters are called in a random manner. You can also set this option to round_robin to call the adapters in a round-robin manner.

Previous Using GenAI-Perf to Benchmark
© Copyright © 2024, NVIDIA Corporation. Last updated on Jul 1, 2024.