Profile with MMVU Dataset

AIPerf supports benchmarking using the MMVU dataset, an expert-level video understanding benchmark that tests multi-discipline reasoning over video content. Each sample contains a video URL and a question (multiple-choice or open-ended) that requires watching the video to answer.

This guide covers profiling OpenAI-compatible video language models using the MMVU public dataset.

Start a vLLM Server

Launch a vLLM server with a video-capable vision language model:

$ docker pull vllm/vllm-openai:latest
$ docker run --gpus all -p 8000:8000 -e HF_TOKEN vllm/vllm-openai:latest \
>   --model Qwen/Qwen2-VL-2B-Instruct \
>   --enforce-eager

Verify the server is ready:

$ timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen2-VL-2B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }

AIPerf loads the MMVU dataset from HuggingFace, combines each question with its multiple-choice options, attaches the video URL, and sends each pair as a single-turn video request. The prompt format matches vLLM’s own MMVU benchmark format.

$ aiperf profile \
>     --model Qwen/Qwen2-VL-2B-Instruct \
>     --endpoint-type chat \
>     --streaming \
>     --url localhost:8000 \
>     --public-dataset mmvu \
>     --request-count 5 \
>     --concurrency 2 \
>     --output-tokens-mean 128

Sample Output (Successful Run):

                                     NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃           Metric ┃        avg ┃        min ┃        max ┃        p99 ┃        p90 ┃        p50 ┃        std ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│    Time to First │ 236,267.00 │   2,967.98 │ 535,809.00 │ 528,246.99 │ 460,180.00 │ 292,846.13 │ 206,874.00 │
│       Token (ms) │            │            │            │            │            │            │            │
│   Time to Second │ 157,173.00 │     113.08 │ 473,750.00 │ 467,270.27 │ 408,951.00 │     127.74 │ 199,053.00 │
│       Token (ms) │            │            │            │            │            │            │            │
│  Request Latency │ 476,346.00 │ 297,204.97 │ 841,020.00 │ 829,081.39 │ 721,631.00 │ 350,652.38 │ 200,572.00 │
│             (ms) │            │            │            │            │            │            │            │
│      Inter Token │   3,631.07 │     106.31 │  11,204.46 │  11,020.23 │   9,362.19 │     127.14 │   4,543.17 │
│     Latency (ms) │            │            │            │            │            │            │            │
│     Output Token │       5.19 │       0.09 │       9.41 │       9.37 │       9.01 │       7.87 │       4.17 │
│   Throughput Per │            │            │            │            │            │            │            │
│             User │            │            │            │            │            │            │            │
│     (tokens/sec) │            │            │            │            │            │            │            │
│  Output Sequence │      58.00 │      32.00 │     128.00 │     125.04 │      98.40 │      42.00 │      35.84 │
│  Length (tokens) │            │            │            │            │            │            │            │
│   Input Sequence │      26.00 │       9.00 │      67.00 │      65.72 │      54.20 │      10.00 │      22.79 │
│  Length (tokens) │            │            │            │            │            │            │            │
│     Output Token │       0.24 │        N/A │        N/A │        N/A │        N/A │        N/A │        N/A │
│       Throughput │            │            │            │            │            │            │            │
│     (tokens/sec) │            │            │            │            │            │            │            │
│    Request Count │       5.00 │        N/A │        N/A │        N/A │        N/A │        N/A │        N/A │
│       (requests) │            │            │            │            │            │            │            │
└──────────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Note: High TTFT variance (3s min, 536s max) is expected — the model server fetches each video URL from HuggingFace during inference, and fetch time varies with video size and network conditions.

Notes

The video column in MMVU contains HTTPS URLs pointing to .mp4 files hosted on HuggingFace. AIPerf passes these URLs directly to the model server, which fetches the video during inference.
For multiple-choice questions, choices are appended to the question in the format A.option B.option .... Open-ended questions use the question text only.
The dataset has a validation split with samples spanning multiple academic disciplines (Art, Science, Engineering, etc.).