Profile with MMStar Dataset
AIPerf supports benchmarking using the MMStar dataset, a multimodal visual question answering benchmark that tests fine-grained visual perception and reasoning. Each sample contains an image and a question that requires understanding the image to answer.
This guide covers profiling OpenAI-compatible vision language models using the MMStar public dataset.
Start a vLLM Server
Launch a vLLM server with a vision language model:
$ docker pull vllm/vllm-openai:latest $ docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ > --model Qwen/Qwen2-VL-2B-Instruct
Verify the server is ready:
$ curl -s localhost:8000/v1/chat/completions \ > -H "Content-Type: application/json" \ > -d '{"model":"Qwen/Qwen2-VL-2B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'
Profile with MMStar Dataset
AIPerf loads the MMStar dataset from HuggingFace, attaches the image from each row to the question, and sends each pair as a single-turn vision request.
$ aiperf profile \ > --model Qwen/Qwen2-VL-2B-Instruct \ > --endpoint-type chat \ > --streaming \ > --url localhost:8000 \ > --public-dataset mmstar \ > --request-count 10 \ > --concurrency 4
Sample Output (Successful Run):
NVIDIA AIPerf | LLM Metrics ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Time to First │ 36,158.25 │ 10,145.84 │ 68,830.92 │ 68,830.92 │ 68,830.88 │ 28,015.68 │ 22,647.12 │ │ Token (ms) │ │ │ │ │ │ │ │ │ Time to Second │ 8,736.44 │ 89.13 │ 53,118.68 │ 49,851.97 │ 20,451.58 │ 113.97 │ 16,181.79 │ │ Token (ms) │ │ │ │ │ │ │ │ │ Time to First │ 36,158.25 │ 10,145.84 │ 68,830.92 │ 68,830.92 │ 68,830.88 │ 28,015.68 │ 22,647.12 │ │ Output Token │ │ │ │ │ │ │ │ │ (ms) │ │ │ │ │ │ │ │ │ Request Latency │ 70,665.03 │ 10,972.01 │ 164,879.38 │ 158,399.81 │ 100,083.68 │ 65,599.64 │ 40,227.69 │ │ (ms) │ │ │ │ │ │ │ │ │ Inter Token │ 1,445.22 │ 82.62 │ 4,143.54 │ 3,985.99 │ 2,568.02 │ 1,356.06 │ 1,277.29 │ │ Latency (ms) │ │ │ │ │ │ │ │ │ Output Token │ 3.50 │ 0.24 │ 12.10 │ 11.81 │ 9.18 │ 0.98 │ 4.30 │ │ Throughput Per │ │ │ │ │ │ │ │ │ User │ │ │ │ │ │ │ │ │ (tokens/sec/use… │ │ │ │ │ │ │ │ │ Output Sequence │ 26.40 │ 7.00 │ 118.00 │ 110.71 │ 45.10 │ 15.00 │ 31.52 │ │ Length (tokens) │ │ │ │ │ │ │ │ │ Input Sequence │ 41.60 │ 25.00 │ 59.00 │ 58.46 │ 53.60 │ 41.50 │ 11.32 │ │ Length (tokens) │ │ │ │ │ │ │ │ │ Output Token │ 1.47 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │ │ Throughput │ │ │ │ │ │ │ │ │ (tokens/sec) │ │ │ │ │ │ │ │ │ Image Throughput │ 0.02 │ 0.01 │ 0.09 │ 0.09 │ 0.03 │ 0.02 │ 0.02 │ │ (images/sec) │ │ │ │ │ │ │ │ │ Image Latency │ 70,665.03 │ 10,972.01 │ 164,879.38 │ 158,399.81 │ 100,083.68 │ 65,599.64 │ 40,227.69 │ │ (ms/image) │ │ │ │ │ │ │ │ │ Request │ 0.06 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │ │ Throughput │ │ │ │ │ │ │ │ │ (requests/sec) │ │ │ │ │ │ │ │ │ Request Count │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │ │ (requests) │ │ │ │ │ │ │ │ └──────────────────┴───────────┴───────────┴────────────┴────────────┴────────────┴───────────┴───────────┘