Profile Multiple LoRA Adapters#

GenAI-Perf allows you to profile multiple LoRA adapters on top of a base model.

Select LoRA Adapters#

To do this, list multiple adapters after the model name option -m:

genai-perf profile \
    -m lora_adapter1 lora_adapter2 lora_adapter3

Choose a Strategy for Selecting Models#

When profiling with multiple models, you can specify how the models should be assigned to prompts using the --model-selection-strategy option:

genai-perf profile \
    -m lora_adapter1 lora_adapter2 lora_adapter3 \
    --model-selection-strategy round_robin

This setup will cycle through the lora_adapter1, lora_adapter2, and lora_adapter3 models in a round-robin manner for each prompt.

For more details on additional options and configurations, refer to the Command Line Options section in the README.

Profile Llama running on OpenAI Completions API-Compatible Server #

Run Llama on OpenAI Chat Completions API-compatible server#

See instructions

Download the adapters:

python3
from huggingface_hub import snapshot_download
lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
lora_path_2 = snapshot_download(repo_id="monsterapi/llama2-code-generation")

Run the vLLM inference server:

docker run -it --net=host --rm --gpus=all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-2-7b-hf \
    --dtype float16 \
    --max-model-len 1024 \
    --lora-modules \
    adapter1=/root/.cache/huggingface/hub/models--monsterapi--llama2-code-generation/snapshots/${SNAPSHOT_ID}/  \
    adapter2=/root/.cache/huggingface/hub/models--yard1-llama-2-7b-sql-lora-test/snapshots/${SNAPSHOT_ID}/ \
    --enable-lora

Run GenAI-Perf#

Run GenAI-Perf from the Triton Inference Server SDK container:

export RELEASE="yy.mm" # e.g. export RELEASE="24.08"

docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
genai-perf profile \
  -m adapter1 adapter2 \
  --service-kind openai \
  --endpoint-type completions \
  --model-selection-strategy round_robin

Example output:

                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│              Request latency (ms) │ 442.59 │ 175.95 │ 652.26 │ 608.05 │ 463.43 │ 449.82 │
│            Output sequence length │  16.84 │   2.00 │  19.00 │  19.00 │  17.00 │  17.00 │
│             Input sequence length │ 550.05 │ 550.00 │ 553.00 │ 551.40 │ 550.00 │ 550.00 │
│ Output token throughput (per sec) │  38.04 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   2.26 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Profile Mistral running on Hugging Face TGI Server #

Run Mistral on Hugging Face TGI server#

See instructions

Run the TGI server:

mkdir data
model=mistralai/Mistral-7B-v0.1
volume=$PWD/data

docker run \
    --gpus all \
    --shm-size 1g \
    -p 8000:80 \
    -v $volume:/data \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    ghcr.io/huggingface/text-generation-inference:2.1.1 \
    --model-id $model \
    --lora-adapters=predibase/customer_support,predibase/magicoder

Run GenAI-Perf#

Run GenAI-Perf from the Triton Inference Server SDK container:

export RELEASE="yy.mm" # e.g. export RELEASE="24.08"

docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
genai-perf profile \
  -m predibase/customer_support predibase/magicoder \
  --service-kind openai \
  --endpoint-type completions \
  --model-selection-strategy round_robin

Example output:

                                   NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃                         Statistic ┃      avg ┃    min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│              Request latency (ms) │ 1,655.06 │ 155.95 │ 1,942.88 │ 1,941.54 │ 1,935.78 │ 1,927.85 │
│            Output sequence length │    88.43 │   6.00 │   108.00 │   107.80 │   106.00 │   103.00 │
│             Input sequence length │   550.00 │ 550.00 │   550.00 │   550.00 │   550.00 │   550.00 │
│ Output token throughput (per sec) │    53.43 │    N/A │      N/A │      N/A │      N/A │      N/A │
│      Request throughput (per sec) │     0.60 │    N/A │      N/A │      N/A │      N/A │      N/A │
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴──────────┴──────────┘

Profile Mistral running on Lorax Server #

Run Mistral on Lorax server#

See instructions

Run the TGI server:

mkdir data
model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run \
    --gpus all \
    --shm-size 1g \
    -p 8000:80 \
    -v $volume:/data \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
     ghcr.io/predibase/lorax:main \
    --model-id $model

Run GenAI-Perf#

Run GenAI-Perf from the Triton Inference Server SDK container:

export RELEASE="yy.mm" # e.g. export RELEASE="24.08"

docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
genai-perf profile \
  -m alignment-handbook/zephyr-7b-dpo-lora Undi95/Mistral-7B-roleplay_alpaca-lora \
  --service-kind openai \
  --endpoint-type completions \
  --model-selection-strategy round_robin \
  --concurrency=128

Example output:

                                      NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                         Statistic ┃       avg ┃      min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│              Request latency (ms) │ 20,277.80 │ 1,229.48 │ 33,166.99 │ 33,082.30 │ 32,320.14 │ 31,206.61 │
│            Output sequence length │     24.86 │     4.00 │     96.00 │     91.50 │     51.00 │     19.50 │
│             Input sequence length │    550.00 │   550.00 │    550.00 │    550.00 │    550.00 │    550.00 │
│ Output token throughput (per sec) │      5.25 │      N/A │       N/A │       N/A │       N/A │       N/A │
│      Request throughput (per sec) │      0.21 │      N/A │       N/A │       N/A │       N/A │       N/A │
└───────────────────────────────────┴───────────┴──────────┴───────────┴───────────┴───────────┴───────────┘