Profile Ranking Models with AIPerf

View as Markdown

AIPerf supports benchmarking ranking and reranking models, including those served through Hugging Face Text Embeddings Inference (TEI) or Cohere Re-Rank APIs. These models take a query and one or more passages, returning a similarity or relevance score.


Section 1. Profile Hugging Face TEI Re-Rank Models

Start a Hugging Face TEI Server

Launch a Hugging Face Text Embeddings Inference (TEI) container in re-ranker mode:

$docker run --gpus all --rm -it \
> -p 8080:80 \
> -e MODEL_ID=BAAI/bge-reranker-base \
> ghcr.io/huggingface/text-embeddings-inference:latest \
> --model-id BAAI/bge-reranker-base --port 80
$# Verify server is running
$curl -s http://localhost:8080/rerank \
> -H "Content-Type: application/json" \
> -d '{"query":"What is AI?", "texts":["AI is artificial intelligence.","Bananas are yellow."]}' | jq

Profile using Synthetic Inputs

Run AIPerf using the following command:

$aiperf profile \
> -m BAAI/bge-reranker-base \
> --endpoint-type hf_tei_rankings \
> --url localhost:8080 \
> --request-count 10 \
> --rankings-passages-mean 5 \
> --rankings-passages-stddev 1 \
> --rankings-passages-prompt-token-mean 32 \
> --rankings-passages-prompt-token-stddev 8 \
> --rankings-query-prompt-token-mean 16 \
> --rankings-query-prompt-token-stddev 4

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO AIPerf System is PROFILING
Profiling: 10/10 |████████████████████████| 100% [00:02<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/BAAI_bge-reranker-base-rankings/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Request Latency (ms) │ 52.34 │ 45.12 │ 68.45 │ 65.23 │ 51.89 │
│ Request Throughput (req/s) │ 5.12 │ - │ - │ - │ - │
└────────────────────────────┴───────┴───────┴───────┴───────┴───────┘
JSON Export: artifacts/BAAI_bge-reranker-base-rankings/profile_export_aiperf.json

The rankings-specific token options cannot be used together with --prompt-input-tokens-mean or --prompt-input-tokens-stddev. Use the rankings-specific options for controlling token counts in rankings queries and passages.

Profile using Custom Inputs

Create a file named rankings.jsonl where each line represents a ranking request with a query and one or more passages.

$cat <<EOF > rankings.jsonl
${"texts":[{"name":"query","contents":["What is AI topic 0?"]},{"name":"passages","contents":["AI passage 0"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 1?"]},{"name":"passages","contents":["AI passage 1"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 2?"]},{"name":"passages","contents":["AI passage 2"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 3?"]},{"name":"passages","contents":["AI passage 3"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 4?"]},{"name":"passages","contents":["AI passage 4"]}]}
$EOF

Run AIPerf using the following command:

$aiperf profile \
> -m BAAI/bge-reranker-base \
> --endpoint-type hf_tei_rankings \
> --url localhost:8080 \
> --input-file ./rankings.jsonl \
> --custom-dataset-type single_turn \
> --request-count 10

Section 2. Profile Cohere Re-Rank API

Start vLLM Server in Cohere Mode

Run vLLM with the --runner pooling flag to enable reranking behavior:

$docker run --gpus all -p 8080:8000 \
> -e HF_TOKEN=<HF_TOKEN> \
> vllm/vllm-openai:latest \
> --model BAAI/bge-reranker-v2-m3 \
> --runner pooling
$# Verify the server
$curl -s http://localhost:8080/v1/rerank \
> -H "Content-Type: application/json" \
> -d '{"query":"What is AI?","documents":["Artificial intelligence overview","Bananas are yellow"]}' | jq

Profile using Synthetic Inputs

Run AIPerf using the following command:

$aiperf profile \
> -m BAAI/bge-reranker-v2-m3 \
> --endpoint-type cohere_rankings \
> --url localhost:8080 \
> --request-count 10

Profile using Custom Inputs

Create a file named rankings.jsonl:

$cat <<EOF > rankings.jsonl
${"texts":[{"name":"query","contents":["What is AI topic 0?"]},{"name":"passages","contents":["AI passage 0"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 1?"]},{"name":"passages","contents":["AI passage 1"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 2?"]},{"name":"passages","contents":["AI passage 2"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 3?"]},{"name":"passages","contents":["AI passage 3"]}]}
${"texts":[{"name":"query","contents":["What is AI topic 4?"]},{"name":"passages","contents":["AI passage 4"]}]}
$EOF

Run AIPerf:

$aiperf profile \
> -m BAAI/bge-reranker-v2-m3 \
> --endpoint-type cohere_rankings \
> --url localhost:8080 \
> --input-file ./rankings.jsonl \
> --custom-dataset-type single_turn \
> --request-count 10