For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Custom Dataset Guide
      • Inline Datasets
      • Custom Prompt Benchmarking
      • Profile with ShareGPT Dataset
      • Synthetic Dataset Generation
      • Profile with InstructCoder Dataset
      • Profile with AIMO Dataset
      • Profile with MMStar Dataset
      • Profile with MMVU Dataset
      • Profile with LLaVA-OneVision Dataset
      • Profile with VisionArena Dataset
      • Profile with Blazedit Dataset
      • Profile with SpecBench Dataset
      • Profile with SPEED-Bench Dataset
      • Profile with Bailian Traces
      • Profile with BurstGPT Traces
      • Replay SageMaker Data Capture Traces
      • Raw Payload Replay
      • Inputs JSON Replay
      • Multi-Turn Conversations
      • Sequence Length Distributions for Advanced Benchmarking
      • Prefix Data Synthesis Tutorial
      • Agentic Code Dataset Generator
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Available Dataset Variants
  • Aggregate Datasets
  • Per-Category Qualitative Datasets (80 prompts each)
  • Per-Entropy-Tier Throughput Datasets (512 prompts each)
  • Start a Server with Speculative Decoding
  • Server Metrics Endpoint
  • Recommended Defaults
  • Non-Reasoning Models
  • Reasoning Models
  • Per-Category Acceptance Rate Benchmarking
  • Single Category
  • All 11 Categories with Matrix Report
  • Profile with Aggregate Qualitative Split
  • Profile with Throughput Splits
  • Per-Entropy-Tier Throughput
  • Disable Server Metrics
  • Pre-download Dataset for Offline Use
TutorialsDatasets & Inputs

Profile with SPEED-Bench Dataset

||View as Markdown|
Previous

Profile with SpecBench Dataset

Next

Profile with Bailian Traces

AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.

This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.


Available Dataset Variants

Aggregate Datasets

These load all categories combined in a single dataset:

Dataset NameSamplesDescription
speed_bench_qualitative880All 11 semantic domains combined
speed_bench_throughput_1k1,536~1K input tokens, all 3 entropy tiers
speed_bench_throughput_2k1,536~2K input tokens, all 3 entropy tiers
speed_bench_throughput_8k1,536~8K input tokens, all 3 entropy tiers
speed_bench_throughput_16k1,536~16K input tokens, all 3 entropy tiers
speed_bench_throughput_32k1,536~32K input tokens, all 3 entropy tiers

Per-Category Qualitative Datasets (80 prompts each)

For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:

Dataset NameCategory
speed_bench_codingCode generation and programming
speed_bench_humanitiesHistory, philosophy, liberal arts
speed_bench_mathMathematical reasoning
speed_bench_multilingualTasks across 23 languages
speed_bench_qaQuestion answering
speed_bench_ragRetrieval-augmented generation
speed_bench_reasoningLogical and analytical reasoning
speed_bench_roleplayCreative roleplay and dialogue
speed_bench_stemScience, technology, engineering
speed_bench_summarizationText summarization
speed_bench_writingCreative and technical writing

Per-Entropy-Tier Throughput Datasets (512 prompts each)

Each throughput ISL bucket is also available filtered by entropy tier:

PatternTiersDescription
speed_bench_throughput_{ISL}_low_entropyCode, sortingPredictable output patterns
speed_bench_throughput_{ISL}_mixedNeedle-in-a-haystack, examsModerate unpredictability
speed_bench_throughput_{ISL}_high_entropyCreative writing, dialogueHighly unpredictable output

Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.


Start a Server with Speculative Decoding

Launch an inference server with speculative decoding enabled. For example, with vLLM:

$docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
> --model meta-llama/Llama-3.1-8B-Instruct \
> --speculative-model meta-llama/Llama-3.2-1B-Instruct \
> --num-speculative-tokens 5

Verify the server is ready:

$curl -s localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Server Metrics Endpoint

AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:

Server TypeMetrics PathFlag Needed
Standalone vLLM / SGLang/metrics (default)None (auto-discovered)
NIM-LLM containers/v1/metrics--server-metrics http://localhost:8000/v1/metrics

Recommended Defaults

Non-Reasoning Models

For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:

$aiperf profile \
> --model meta/llama-3.1-8b-instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset speed_bench_coding \
> --osl 4096 \
> --extra-inputs temperature:0 \
> --concurrency 16

Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.

Reasoning Models

For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card’s recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.


Per-Category Acceptance Rate Benchmarking

To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server’s Prometheus endpoint.

Single Category

$aiperf profile \
> --model meta/llama-3.1-8b-instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset speed_bench_coding \
> --server-metrics http://localhost:8000/metrics \
> --osl 4096 \
> --extra-inputs temperature:0 \
> --concurrency 16 \
> --output-artifact-dir ./artifacts/speed_bench_coding

All 11 Categories with Matrix Report

Loop through all categories, then assemble results into a per-category matrix:

$CATEGORIES="coding humanities math multilingual qa rag reasoning roleplay stem summarization writing"
$MODEL="meta/llama-3.1-8b-instruct"
$
$for cat in $CATEGORIES; do
$ echo "=== Running category: $cat ==="
$ aiperf profile \
> --model "$MODEL" \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset "speed_bench_${cat}" \
> --server-metrics http://localhost:8000/metrics \
> --osl 4096 \
> --extra-inputs temperature:0 \
> --concurrency 16 \
> --output-artifact-dir "./artifacts/speed_bench_${cat}"
$done
$
$# Assemble the matrix report
$aiperf speed-bench-report ./artifacts/ --format both

This produces a CSV (speed_bench_report.csv) and console table:

SPEED-Bench Acceptance Length Report
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Model ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ meta/llama-3.1-8b-instruct │ 1.80 │ 1.84 │ 1.78 │ 1.76 │ 1.78 │
└────────────────────────────┴────────┴────────────┴──────┴─────────┴─────────┘

The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang’s direct spec_accept_length gauge.

Additional report metrics:

$# Acceptance rate matrix (accepted / draft tokens)
$aiperf speed-bench-report ./artifacts/ --metric accept_rate
$
$# Throughput matrix (output tokens/sec per category)
$aiperf speed-bench-report ./artifacts/ --metric throughput

Profile with Aggregate Qualitative Split

To run all 880 prompts in a single benchmark (without per-category breakdown):

$aiperf profile \
> --model meta/llama-3.1-8b-instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset speed_bench_qualitative \
> --server-metrics http://localhost:8000/metrics \
> --concurrency 16

Profile with Throughput Splits

The throughput splits benchmark end-to-end performance at fixed input sequence lengths:

$aiperf profile \
> --model meta/llama-3.1-8b-instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset speed_bench_throughput_1k \
> --server-metrics http://localhost:8000/metrics \
> --concurrency 64 \
> --benchmark-duration 120

Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.

Per-Entropy-Tier Throughput

To isolate entropy effects on acceptance rate at a given ISL:

$for tier in low_entropy mixed high_entropy; do
$ echo "=== Running throughput_1k tier: $tier ==="
$ aiperf profile \
> --model meta/llama-3.1-8b-instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset "speed_bench_throughput_1k_${tier}" \
> --server-metrics http://localhost:8000/metrics \
> --concurrency 64 \
> --benchmark-duration 60
$done

Disable Server Metrics

Server metrics collection is enabled by default. To disable it:

$aiperf profile \
> --model meta/llama-3.1-8b-instruct \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset speed_bench_qualitative \
> --no-server-metrics \
> --concurrency 16

Pre-download Dataset for Offline Use

AIPerf automatically downloads and caches the dataset on first use. To pre-download for container builds or air-gapped environments:

$huggingface-cli download nvidia/SPEED-Bench --repo-type dataset

Or selectively download specific splits:

1from datasets import load_dataset
2
3for subset in ["qualitative", "throughput_1k", "throughput_2k",
4 "throughput_8k", "throughput_16k", "throughput_32k"]:
5 load_dataset("nvidia/SPEED-Bench", name=subset, split="test",
6 trust_remote_code=False)

Set HF_HOME to control the cache location (e.g., ENV HF_HOME=/opt/hf_cache in a Dockerfile).