Profile with SPEED-Bench Dataset
Profile with SPEED-Bench Dataset
AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.
This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.
Available Dataset Variants
Aggregate Datasets
These load all categories combined in a single dataset:
Per-Category Qualitative Datasets (80 prompts each)
For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:
Per-Entropy-Tier Throughput Datasets (512 prompts each)
Each throughput ISL bucket is also available filtered by entropy tier:
Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.
Start a Server with Speculative Decoding
Launch an inference server with speculative decoding enabled. For example, with vLLM:
Verify the server is ready:
Server Metrics Endpoint
AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:
Recommended Defaults
Non-Reasoning Models
For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:
Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.
Reasoning Models
For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card’s recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.
Per-Category Acceptance Rate Benchmarking
To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server’s Prometheus endpoint.
Single Category
All 11 Categories with Matrix Report
Loop through all categories, then assemble results into a per-category matrix:
This produces a CSV (speed_bench_report.csv) and console table:
The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang’s direct spec_accept_length gauge.
Additional report metrics:
Profile with Aggregate Qualitative Split
To run all 880 prompts in a single benchmark (without per-category breakdown):
Profile with Throughput Splits
The throughput splits benchmark end-to-end performance at fixed input sequence lengths:
Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.
Per-Entropy-Tier Throughput
To isolate entropy effects on acceptance rate at a given ISL:
Disable Server Metrics
Server metrics collection is enabled by default. To disable it:
Pre-download Dataset for Offline Use
AIPerf automatically downloads and caches the dataset on first use. To pre-download for container builds or air-gapped environments:
Or selectively download specific splits:
Set HF_HOME to control the cache location (e.g., ENV HF_HOME=/opt/hf_cache in a Dockerfile).