Profile with VisionArena Dataset

View as Markdown

AIPerf supports benchmarking using the VisionArena dataset, a collection of real-world conversations between users and vision language models gathered from Chatbot Arena. Each sample contains a real user image and question, covering tasks like captioning, OCR, diagram interpretation, and visual reasoning.

This guide covers profiling OpenAI-compatible vision language models using the VisionArena public dataset.

Note: VisionArena requires HuggingFace authentication. Set your HF_TOKEN environment variable before running.


Start a vLLM Server

Launch a vLLM server with a vision language model:

$python -m vllm.entrypoints.openai.api_server \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --host 127.0.0.1 \
> --port 8000

Verify the server is ready:

$curl -s 127.0.0.1:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"Qwen/Qwen2-VL-2B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Profile with VisionArena Dataset

$aiperf profile \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url 127.0.0.1:8000 \
> --public-dataset vision_arena \
> --request-count 10 \
> --concurrency 4

Sample Output (Successful Run):

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ Time to First Token (ms) │ 1,230.82 │ 480.86 │ 2,375.71 │ 2,375.70 │ 2,375.62 │ 712.68 │ 793.32 │
│ Time to Second Token │ 286.66 │ 83.72 │ 1,894.74 │ 1,735.21 │ 299.41 │ 108.29 │ 536.11 │
│ (ms) │ │ │ │ │ │ │ │
│ Time to First Output │ 1,230.82 │ 480.86 │ 2,375.71 │ 2,375.70 │ 2,375.62 │ 712.68 │ 793.32 │
│ Token (ms) │ │ │ │ │ │ │ │
│ Request Latency (ms) │ 5,847.57 │ 2,735.34 │ 10,756.76 │ 10,502.44 │ 8,213.58 │ 5,800.62 │ 2,337.90 │
│ Inter Token Latency (ms) │ 134.88 │ 62.38 │ 186.61 │ 185.08 │ 171.30 │ 138.40 │ 34.96 │
│ Output Token Throughput │ 8.13 │ 5.36 │ 16.03 │ 15.44 │ 10.09 │ 7.23 │ 2.95 │
│ Per User │ │ │ │ │ │ │ │
│ (tokens/sec/user) │ │ │ │ │ │ │ │
│ Output Sequence Length │ 37.50 │ 9.00 │ 96.00 │ 91.68 │ 52.80 │ 35.50 │ 23.56 │
│ (tokens) │ │ │ │ │ │ │ │
│ Input Sequence Length │ 28.90 │ 4.00 │ 167.00 │ 157.55 │ 72.50 │ 8.50 │ 48.88 │
│ (tokens) │ │ │ │ │ │ │ │
│ Output Token Throughput │ 22.63 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Request Throughput │ 0.60 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count (requests) │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
└──────────────────────────┴──────────┴──────────┴───────────┴───────────┴──────────┴──────────┴──────────┘

Higher input sequence length compared to text-only datasets is expected — each request includes an encoded image alongside the question text.