For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Profile OpenAI-Compatible Text APIs Using AIPerf
      • Profile the OpenAI Responses API with AIPerf
      • Profile Hugging Face TGI Models with AIPerf
      • Profile Vision Language Models with AIPerf
      • Profile Audio Language Models with AIPerf
      • Profile ASR Models with Public Datasets
      • Profile Embedding Models with AIPerf
      • Profile Ranking Models with AIPerf
      • Profile NIM Image Retrieval with AIPerf
      • SGLang Image Generation
      • SGLang Image Edit
      • SGLang Video Generation
      • Synthetic Video Generation
      • Template Endpoint
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Start a vLLM Server
  • Profile with Synthetic Images
  • Profile with Custom Input File
TutorialsModel & Endpoint Guides

Profile Vision Language Models with AIPerf

||View as Markdown|
Previous

Profile Hugging Face TGI Models with AIPerf

Next

Profile Audio Language Models with AIPerf

AIPerf supports benchmarking Vision Language Models (VLMs) that process both text and images.

This guide covers profiling vision models using OpenAI-compatible chat completions endpoints with vLLM.


Start a vLLM Server

Launch a vLLM server with a vision language model:

$docker pull vllm/vllm-openai:latest
$docker run --gpus all -p 8000:8000 -e HF_TOKEN vllm/vllm-openai:latest \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --enforce-eager

Verify the server is ready:

$timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen2-VL-2B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }

Profile with Synthetic Images

AIPerf can generate synthetic images for benchmarking. By default, images are generated as random noise at the requested dimensions — no on-disk assets required, and the pool is effectively unbounded so servers cannot dedupe on identical inputs. Pass --image-source assets to instead sample and resize the 4 bundled natural images (smaller payload bytes), or --image-source <path> to sample from your own directory.

$aiperf profile \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --endpoint-type chat \
> --image-width-mean 512 \
> --image-height-mean 512 \
> --synthetic-input-tokens-mean 100 \
> --streaming \
> --url localhost:8000 \
> --request-count 20 \
> --concurrency 4

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Generating synthetic images (512x512 px)
INFO AIPerf System is PROFILING
Profiling: 20/20 |████████████████████████| 100% [01:45<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency4/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 2345.67 │ 1890.34 │ 2987.12 │ 2923.45 │ 2312.89 │
│ Time to First Token (ms) │ 456.78 │ 378.90 │ 598.45 │ 578.23 │ 445.67 │
│ Inter Token Latency (ms) │ 18.90 │ 14.56 │ 25.34 │ 24.12 │ 18.45 │
│ Output Token Count (tokens) │ 150.00 │ 120.00 │ 180.00 │ 178.00 │ 148.00 │
│ Request Throughput (req/s) │ 4.89 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency4/profile_export_aiperf.json

Profile with Custom Input File

Create a JSONL file with text prompts and image URLs:

$cat <<EOF > inputs.jsonl
${"texts": ["Describe this image in detail."], "images": ["https://picsum.photos/512/512?random=1"], "output_length": 200}
${"texts": ["What objects are visible in this image?"], "images": ["https://picsum.photos/512/512?random=2"]}
${"texts": ["Analyze the composition of this photo."], "images": ["https://picsum.photos/512/512?random=3"], "output_length": 300}
${"texts": ["What is the main subject of this image?"], "images": ["https://picsum.photos/512/512?random=4"]}
${"texts": ["Provide a caption for this image."], "images": ["https://picsum.photos/512/512?random=5"], "output_length": 50}
$EOF

Run AIPerf using the custom input file:

$aiperf profile \
> --model Qwen/Qwen2-VL-2B-Instruct \
> --endpoint-type chat \
> --input-file inputs.jsonl \
> --custom-dataset-type single_turn \
> --streaming \
> --url localhost:8000 \
> --request-count 5

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Loaded 5 entries from inputs.jsonl
INFO Using single_turn dataset type with custom images
INFO AIPerf System is PROFILING
Profiling: 5/5 |████████████████████████| 100% [00:25<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency1/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 2456.89 │ 2012.45 │ 2890.34 │ 2890.34 │ 2398.12 │
│ Time to First Token (ms) │ 478.90 │ 398.23 │ 567.89 │ 567.89 │ 467.34 │
│ Inter Token Latency (ms) │ 19.45 │ 15.67 │ 24.12 │ 24.12 │ 19.01 │
│ Output Token Count (tokens) │ 156.00 │ 128.00 │ 185.00 │ 185.00 │ 154.00 │
│ Request Throughput (req/s) │ 2.34 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency1/profile_export_aiperf.json