For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Custom Dataset Guide
      • Inline Datasets
      • Custom Prompt Benchmarking
      • Profile with ShareGPT Dataset
      • Synthetic Dataset Generation
      • Profile with InstructCoder Dataset
      • Profile with AIMO Dataset
      • Profile with MMStar Dataset
      • Profile with MMVU Dataset
      • Profile with LLaVA-OneVision Dataset
      • Profile with VisionArena Dataset
      • Profile with Blazedit Dataset
      • Profile with SpecBench Dataset
      • Profile with SPEED-Bench Dataset
      • Profile with Bailian Traces
      • Profile with BurstGPT Traces
      • Replay SageMaker Data Capture Traces
      • Raw Payload Replay
      • Inputs JSON Replay
      • Multi-Turn Conversations
      • Sequence Length Distributions for Advanced Benchmarking
      • Prefix Data Synthesis Tutorial
      • Agentic Code Dataset Generator
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Start a vLLM Server
  • Profile with SpecBench Dataset
TutorialsDatasets & Inputs

Profile with SpecBench Dataset

||View as Markdown|
Previous

Profile with Blazedit Dataset

Next

Profile with SPEED-Bench Dataset

AIPerf supports benchmarking using the SpecBench dataset, which contains diverse questions across writing, reasoning, math, and coding categories. This dataset is commonly used for evaluating speculative decoding methods.

This guide covers profiling OpenAI-compatible chat completions endpoints using the SpecBench public dataset.


Start a vLLM Server

Launch a vLLM server with a chat model:

$docker pull vllm/vllm-openai:latest
$docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
> --model Qwen/Qwen3-0.6B

Verify the server is ready:

$curl -s localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Profile with SpecBench Dataset

AIPerf downloads the SpecBench JSONL file from GitHub and uses the first turn of each question as a single-turn prompt.

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset spec_bench \
> --request-count 10 \
> --concurrency 4

Sample Output (Successful Run):

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Time to First Token │ 1,184.13 │ 385.39 │ 2,252.54 │ 2,252.54 │ 2,252.52 │ 570.29 │ 862.86 │
│ (ms) │ │ │ │ │ │ │ │
│ Time to Second │ 60.70 │ 50.20 │ 73.47 │ 73.21 │ 70.93 │ 59.97 │ 7.68 │
│ Token (ms) │ │ │ │ │ │ │ │
│ Time to First │ 21,668.12 │ 12,749.53 │ 36,041.69 │ 35,432.43 │ 29,949.09 │ 19,877.32 │ 6,284.90 │
│ Output Token (ms) │ │ │ │ │ │ │ │
│ Request Latency │ 36,715.17 │ 22,633.27 │ 69,707.26 │ 68,331.92 │ 55,953.83 │ 30,128.68 │ 14,438.18 │
│ (ms) │ │ │ │ │ │ │ │
│ Inter Token Latency │ 62.47 │ 51.53 │ 69.41 │ 69.26 │ 67.89 │ 64.96 │ 5.98 │
│ (ms) │ │ │ │ │ │ │ │
│ Output Token │ 16.17 │ 14.41 │ 19.40 │ 19.32 │ 18.58 │ 15.40 │ 1.67 │
│ Throughput Per User │ │ │ │ │ │ │ │
│ (tokens/sec/user) │ │ │ │ │ │ │ │
│ Output Sequence │ 572.20 │ 326.00 │ 1,004.00 │ 1,000.31 │ 967.10 │ 501.50 │ 221.81 │
│ Length (tokens) │ │ │ │ │ │ │ │
│ Input Sequence │ 41.50 │ 22.00 │ 96.00 │ 92.49 │ 60.90 │ 35.50 │ 20.86 │
│ Length (tokens) │ │ │ │ │ │ │ │
│ Output Token │ 58.14 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Request Throughput │ 0.10 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests) │ │ │ │ │ │ │ │
└─────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘