AIPerf supports benchmarking using the MMVU dataset, an expert-level video understanding benchmark that tests multi-discipline reasoning over video content. Each sample contains a video URL and a question (multiple-choice or open-ended) that requires watching the video to answer.
This guide covers profiling OpenAI-compatible video language models using the MMVU public dataset.
Launch a vLLM server with a video-capable vision language model:
Verify the server is ready:
AIPerf loads the MMVU dataset from HuggingFace, combines each question with its multiple-choice options, attaches the video URL, and sends each pair as a single-turn video request. The prompt format matches vLLM’s own MMVU benchmark format.
Sample Output (Successful Run):
Note: High TTFT variance (3s min, 536s max) is expected — the model server fetches each video URL from HuggingFace during inference, and fetch time varies with video size and network conditions.
video column in MMVU contains HTTPS URLs pointing to .mp4 files hosted on
HuggingFace. AIPerf passes these URLs directly to the model server, which fetches
the video during inference.A.option B.option .... Open-ended questions use the question text only.validation split with samples spanning multiple academic disciplines
(Art, Science, Engineering, etc.).