For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Custom Dataset Guide
      • Inline Datasets
      • Custom Prompt Benchmarking
      • Profile with ShareGPT Dataset
      • Synthetic Dataset Generation
      • Profile with InstructCoder Dataset
      • Profile with AIMO Dataset
      • Profile with MMStar Dataset
      • Profile with MMVU Dataset
      • Profile with LLaVA-OneVision Dataset
      • Profile with VisionArena Dataset
      • Profile with Blazedit Dataset
      • Profile with SpecBench Dataset
      • Profile with SPEED-Bench Dataset
      • Profile with Bailian Traces
      • Profile with BurstGPT Traces
      • Replay SageMaker Data Capture Traces
      • Raw Payload Replay
      • Inputs JSON Replay
      • Multi-Turn Conversations
      • Sequence Length Distributions for Advanced Benchmarking
      • Prefix Data Synthesis Tutorial
      • Agentic Code Dataset Generator
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Start a vLLM Server
  • Profile with LLaVA-OneVision Dataset
TutorialsDatasets & Inputs

Profile with LLaVA-OneVision Dataset

||View as Markdown|
Previous

Profile with MMVU Dataset

Next

Profile with VisionArena Dataset

AIPerf supports benchmarking using the LLaVA-OneVision dataset, which contains a large multimodal collection of instruction-tuning examples covering charts, diagrams, scientific figures, natural photos, and more.

This guide uses the sharegpt4o subset — GPT-4o annotated ShareGPT conversations with natural scene images.


Start a vLLM Server

Launch a vLLM server with a vision language model:

$python -m vllm.entrypoints.openai.api_server \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --host 127.0.0.1 \
> --port 8000

Verify the server is ready:

$curl -s 127.0.0.1:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"Qwen/Qwen2-VL-2B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Profile with LLaVA-OneVision Dataset

AIPerf loads the sharegpt4o subset from HuggingFace, extracts the first user message and image from each row, and sends each as a single-turn vision request.

$aiperf profile \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --endpoint-type chat \
> --streaming \
> --url 127.0.0.1:8000 \
> --public-dataset llava_onevision \
> --request-count 10 \
> --concurrency 4

Sample Output (Successful Run):

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Time to First │ 42,612.16 │ 2,865.14 │ 124,091.22 │ 118,453.32 │ 67,712.19 │ 39,609.67 │ 36,226.35 │
│ Token (ms) │ │ │ │ │ │ │ │
│ Time to Second │ 117.54 │ 92.84 │ 149.27 │ 147.27 │ 129.28 │ 115.31 │ 13.31 │
│ Token (ms) │ │ │ │ │ │ │ │
│ Time to First │ 42,612.16 │ 2,865.14 │ 124,091.22 │ 118,453.32 │ 67,712.19 │ 39,609.67 │ 36,226.35 │
│ Output Token │ │ │ │ │ │ │ │
│ (ms) │ │ │ │ │ │ │ │
│ Request │ 121,911.45 │ 16,030.77 │ 225,154.30 │ 220,635.78 │ 179,969.13 │ 123,267.35 │ 61,477.79 │
│ Latency (ms) │ │ │ │ │ │ │ │
│ Inter Token │ 462.52 │ 95.96 │ 1,865.22 │ 1,803.95 │ 1,252.60 │ 168.13 │ 564.71 │
│ Latency (ms) │ │ │ │ │ │ │ │
│ Output Token │ 5.31 │ 0.54 │ 10.42 │ 10.25 │ 8.69 │ 5.95 │ 3.16 │
│ Throughput Per │ │ │ │ │ │ │ │
│ User │ │ │ │ │ │ │ │
│ (tokens/sec/u… │ │ │ │ │ │ │ │
│ Output │ 228.60 │ 82.00 │ 421.00 │ 412.99 │ 340.90 │ 228.00 │ 109.17 │
│ Sequence │ │ │ │ │ │ │ │
│ Length │ │ │ │ │ │ │ │
│ (tokens) │ │ │ │ │ │ │ │
│ Input Sequence │ 9.70 │ 4.00 │ 13.00 │ 13.00 │ 13.00 │ 10.50 │ 3.03 │
│ Length │ │ │ │ │ │ │ │
│ (tokens) │ │ │ │ │ │ │ │
│ Output Token │ 7.41 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Image │ 0.01 │ 0.00 │ 0.06 │ 0.06 │ 0.03 │ 0.01 │ 0.02 │
│ Throughput │ │ │ │ │ │ │ │
│ (images/sec) │ │ │ │ │ │ │ │
│ Image Latency │ 121,911.45 │ 16,030.77 │ 225,154.30 │ 220,635.78 │ 179,969.13 │ 123,267.35 │ 61,477.79 │
│ (ms/image) │ │ │ │ │ │ │ │
│ Request │ 0.03 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests) │ │ │ │ │ │ │ │
└────────────────┴────────────┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────┘

LLaVA-OneVision’s GPT-4o annotated responses are detailed and verbose, producing longer output sequences than typical VQA datasets. Use --prompt-output-tokens-mean to cap output length if needed.