Accuracy Benchmarking
Run accuracy evaluation alongside performance profiling using the --accuracy-benchmark flag.
Quick Start
CLI Flags
Endpoint Type: completions vs chat
Both endpoint types are supported. The choice affects prompt format and alignment with reference frameworks:
When --endpoint-type chat is used, MMLU few-shot examples are structured as separate user/assistant message turns (matching lighteval’s PromptManager._prepare_chat_template()). The completions endpoint sends the entire prompt as a single text block.
Temperature: Must be explicitly set to 0 via --extra-inputs '{"temperature": 0}' for deterministic (greedy) decoding. Most LLM servers default to temperature=1.0 when not specified, which introduces random sampling and causes run-to-run variance. lighteval defaults to temperature=0 internally.
Stop sequence: Use --extra-inputs '{"stop": ["\n"]}' to match lighteval’s MMLU behavior (stop at first newline). Can be combined with temperature: --extra-inputs '{"temperature": 0, "stop": ["\n"]}'.
Concurrency: Higher concurrency is faster. --concurrency 10 or above is recommended. Minor run-to-run variance (~0.2% macro) is expected due to GPU floating-point non-determinism; this is independent of concurrency level.
num-requests: Set to at least the total number of benchmark problems (MMLU: 14,042 across 57 subjects).
Examples
Output
Accuracy results are displayed in the console and exported to CSV:
CSV file: <artifact_dir>/accuracy_results.csv
Architecture
All components self-disable when --accuracy-benchmark is not set.