Profile with SPEED-Bench Dataset
Profile with SPEED-Bench Dataset
Profile with SPEED-Bench Dataset
AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.
This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.
These load all categories combined in a single dataset:
For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:
Each throughput ISL bucket is also available filtered by entropy tier:
Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.
Launch an inference server with speculative decoding enabled. For example, with vLLM:
Verify the server is ready:
AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:
For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:
Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.
For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card’s recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.
To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server’s Prometheus endpoint.
Loop through all categories, then assemble results into a per-category matrix:
This produces a CSV (speed_bench_report.csv) and console table:
The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang’s direct spec_accept_length gauge.
Additional report metrics:
To run all 880 prompts in a single benchmark (without per-category breakdown):
The throughput splits benchmark end-to-end performance at fixed input sequence lengths:
Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.
To isolate entropy effects on acceptance rate at a given ISL:
Server metrics collection is enabled by default. To disable it:
AIPerf automatically downloads and caches the dataset on first use. To pre-download for container builds or air-gapped environments:
Or selectively download specific splits:
Set HF_HOME to control the cache location (e.g., ENV HF_HOME=/opt/hf_cache in a Dockerfile).