Accuracy Benchmarking

View as Markdown

Run accuracy evaluation alongside performance profiling using the --accuracy-benchmark flag.

Quick Start

$# MMLU benchmark with 5-shot prompting (chat endpoint, aligned with lighteval)
$aiperf profile Qwen/Qwen2.5-1.5B-Instruct \
> --url http://localhost:8000 \
> --endpoint-type chat \
> --accuracy-benchmark mmlu \
> --accuracy-n-shots 5 \
> --num-requests 15000 \
> --concurrency 10 \
> --extra-inputs '{"temperature": 0, "stop": ["\n"]}'

CLI Flags

FlagDescriptionDefault
--accuracy-benchmarkBenchmark name (mmlu, aime, hellaswag, …)
--accuracy-tasksSpecific subtasks (e.g., MMLU subjects). Accepts comma-separated values (abstract_algebra,anatomy) or repeated flags. Omit for all.all
--accuracy-n-shotsFew-shot example count (0–32). None uses the benchmark default (e.g. MMLU=5).benchmark default
--accuracy-enable-cotEnable chain-of-thought promptingfalse
--accuracy-graderOverride default grader (multiple_choice, exact_match, …)auto
--accuracy-system-promptCustom system prompt
--accuracy-verboseShow per-problem grading detailsfalse

Endpoint Type: completions vs chat

Both endpoint types are supported. The choice affects prompt format and alignment with reference frameworks:

EndpointPrompt formatBest for
completionsSingle flat text to /v1/completionsTraditional MMLU evaluation
chatMulti-turn user/assistant messages to /v1/chat/completionsAligning with lighteval

When --endpoint-type chat is used, MMLU few-shot examples are structured as separate user/assistant message turns (matching lighteval’s PromptManager._prepare_chat_template()). The completions endpoint sends the entire prompt as a single text block.

Temperature: Must be explicitly set to 0 via --extra-inputs '{"temperature": 0}' for deterministic (greedy) decoding. Most LLM servers default to temperature=1.0 when not specified, which introduces random sampling and causes run-to-run variance. lighteval defaults to temperature=0 internally.

Stop sequence: Use --extra-inputs '{"stop": ["\n"]}' to match lighteval’s MMLU behavior (stop at first newline). Can be combined with temperature: --extra-inputs '{"temperature": 0, "stop": ["\n"]}'.

Concurrency: Higher concurrency is faster. --concurrency 10 or above is recommended. Minor run-to-run variance (~0.2% macro) is expected due to GPU floating-point non-determinism; this is independent of concurrency level.

num-requests: Set to at least the total number of benchmark problems (MMLU: 14,042 across 57 subjects).

Examples

$# Single subject, quick test
$aiperf profile my-model --url http://localhost:8000 \
> --endpoint-type chat \
> --accuracy-benchmark mmlu \
> --accuracy-n-shots 5 \
> --accuracy-tasks abstract_algebra \
> --num-requests 100 \
> --concurrency 10 \
> --extra-inputs '{"temperature": 0, "stop": ["\n"]}'
$
$# Full MMLU (57 subjects, 14042 problems)
$aiperf profile my-model --url http://localhost:8000 \
> --endpoint-type chat \
> --accuracy-benchmark mmlu \
> --accuracy-n-shots 5 \
> --num-requests 15000 \
> --concurrency 50 \
> --extra-inputs '{"temperature": 0, "stop": ["\n"]}'
$
$# Completions endpoint (traditional flat-text format)
$aiperf profile my-model --url http://localhost:8000 \
> --endpoint-type completions \
> --accuracy-benchmark mmlu \
> --accuracy-n-shots 5 \
> --num-requests 15000 \
> --concurrency 50 \
> --extra-inputs '{"temperature": 0, "stop": ["\n"]}'

Output

Accuracy results are displayed in the console and exported to CSV:

Accuracy Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ Task ┃ Correct ┃ Total ┃ Accuracy ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ abstract_algebra │ 35 │ 100 │ 35.00% │
│ ... │ ... │ ... │ ... │
│ OVERALL │ 8368 │ 14042 │ 59.59% │
└─────────────────────────┴─────────┴───────┴──────────┘

CSV file: <artifact_dir>/accuracy_results.csv

Architecture

AccuracyDatasetLoader → Conversation/Turn objects (dataset pipeline)
AccuracyRecordProcessor → grades each response (record pipeline)
AccuracyResultsProcessor → aggregates per-task accuracy (results pipeline)
AccuracyConsoleExporter → Rich table output
AccuracyDataExporter → CSV export

All components self-disable when --accuracy-benchmark is not set.