Accuracy Benchmarking
Run accuracy evaluation alongside performance profiling using the --accuracy-benchmark flag.
Quick Start
trt-llm reference alignment
The aime benchmark is aligned with the trt-llm benchmark recipe’s
DeepEval-backed AIME path
(trt-llm-benchmark-recipe/src/accuracy/aime/):
- Dataset:
Maxwell-Jia/AIME_2024, train split. - Defaults:
n_shots=8,enable_cot=True(the recipe enforcesn_shots <= 8and aiperf raisesValueErrorif you exceed it). - Prompt format: byte-equal to
AIMETemplate.generate_output—**Problem**: ... **Solution**: ... **Answer**: ...blocks for few-shots (Solution only when CoT is on), trailingLet's think step-by-step.after the final**Answer**:. - System prompt (auto-injected):
"Please reason step by step, and put your final answer within \\boxed{}."This default lives inplugins.yamlunder theaimebenchmark’sdefault_system_promptmetadata. Override it with--accuracy-system-prompt 'your prompt here'. Pass--accuracy-system-prompt ''to disable injection. - Grader:
MathGraderwith_math_strip.strip_string+ sympy/ latex2sympy2-extendedmath_equal. Requires the[accuracy]extra:uv pip install 'aiperf[accuracy]'. Without those packages installed the grader falls back to a stdlib normalize+Fraction comparison and emits a one-time warning; reference parity is only achieved with the full sympy stack.
Per-benchmark default system prompts
The CLI’s --accuracy-system-prompt flag always wins; the per-benchmark
default is only consulted when the flag is unset. An empty-string default
in metadata is treated as no default (aiperf doesn’t inject a zero-length
system message).
Available Benchmarks
CLI Flags
Endpoint Type: completions vs chat
Both endpoint types are supported. The choice affects prompt format and alignment with reference frameworks:
When --endpoint-type chat is used, MMLU few-shot examples are structured as separate user/assistant message turns (matching lighteval’s PromptManager._prepare_chat_template()). The completions endpoint sends the entire prompt as a single text block.
Temperature: Must be explicitly set to 0 via --extra-inputs '{"temperature": 0}' for deterministic (greedy) decoding. Most LLM servers default to temperature=1.0 when not specified, which introduces random sampling and causes run-to-run variance. lighteval defaults to temperature=0 internally.
Stop sequence: Use --extra-inputs '{"stop": ["\n"]}' to match lighteval’s MMLU behavior (stop at first newline). Can be combined with temperature: --extra-inputs '{"temperature": 0, "stop": ["\n"]}'.
Concurrency: Higher concurrency is faster. --concurrency 10 or above is recommended. Minor run-to-run variance (~0.2% macro) is expected due to GPU floating-point non-determinism; this is independent of concurrency level.
num-requests: Set to at least the total number of benchmark problems (MMLU: 14,042 across 57 subjects).
Examples
Graders
The math grader pipeline (aligned with trt-llm-benchmark-recipe/src/accuracy/aime/):
- Extract the model’s final answer by priority:
- The contents of the last
\boxed{...}in the response (canonical MATH/AIME format). - The tail of an “the answer is X” / “answer: X” / “final answer X” phrase, recursively re-parsed for boxed/numeric content.
- The last numeric literal in the response.
- The contents of the last
- Normalize both prediction and gold via the recipe’s
strip_string: linebreaks/spacing/quote-style braces collapsed,\dfrac/\tfrac→\frac,\left/\rightremoved,\text{...}unwrapped, MathQA-derived unit tokens dropped, infinity/percent/months/dollar-sign normalization, trailing.0decimals trimmed, simplea/brewritten as\frac{a}{b}. - Compare with
math_equal(lowercase string equality → choice-prefix unwrap → numericalisclose(abs_tol=1e-4) with percentage variants → brace/paren strip + lowercase compare → equation-form rewrite (f(x) = y↔y) → symbolic equivalence viasympy.parsing.sympy_parser.parse_exprandlatex2sympy2_extended.latex2sympy).
Symbolic equivalence (e.g. \sqrt{2} ↔ 2^{1/2}, \frac{1}{3} ↔ 0.333333, 1,2,3 ↔ 3,2,1) requires the [accuracy] install:
Without those optional dependencies (sympy, latex2sympy2-extended) the grader falls back to a stdlib normalize + Fraction comparison and emits a single warning the first time it runs. Reference parity with the trt-llm recipe requires the full sympy stack.
When extraction fell back past the \boxed{} step (i.e. the model didn’t follow the boxed-answer instruction), the response is flagged unparsed=True in the per-record output. A correct unparsed response is still scored correct, mirroring multiple_choice’s convention.
Output
Accuracy results are displayed in the console and exported to CSV:
CSV file: <artifact_dir>/accuracy_results.csv
Architecture
All components self-disable when --accuracy-benchmark is not set.