Profile with MMVU Dataset
AIPerf supports benchmarking using the MMVU dataset, an expert-level video understanding benchmark that tests multi-discipline reasoning over video content. Each sample contains a video URL and a question (multiple-choice or open-ended) that requires watching the video to answer.
This guide covers profiling OpenAI-compatible video language models using the MMVU public dataset.
Start a vLLM Server
Launch a vLLM server with a video-capable vision language model:
Verify the server is ready:
Profile with MMVU Dataset
AIPerf loads the MMVU dataset from HuggingFace, combines each question with its multiple-choice options, attaches the video URL, and sends each pair as a single-turn video request. The prompt format matches vLLM’s own MMVU benchmark format.
Sample Output (Successful Run):
Note: High TTFT variance (3s min, 536s max) is expected — the model server fetches each video URL from HuggingFace during inference, and fetch time varies with video size and network conditions.
Notes
- The
videocolumn in MMVU contains HTTPS URLs pointing to.mp4files hosted on HuggingFace. AIPerf passes these URLs directly to the model server, which fetches the video during inference. - For multiple-choice questions, choices are appended to the question in the format
A.option B.option .... Open-ended questions use the question text only. - The dataset has a
validationsplit with samples spanning multiple academic disciplines (Art, Science, Engineering, etc.).