Benchmarks#
Standardized evaluation suites for measuring research quality.
Benchmark |
What it measures |
Dataset size |
Agents tested |
|---|---|---|---|
Factual accuracy on current knowledge |
600 questions |
Shallow, Full pipeline |
|
Report quality (RACE + FACT metrics) |
100 topics |
Deep researcher |
|
Document QA across categories |
900 problems |
Deep researcher |