Is this page helpful?

DeepSearchQA Evaluation for AI-Q Deep Researcher#

This directory contains the evaluation setup for running DeepSearchQA benchmark from DeepMind on the deep researcher agent.

Overview#

DeepSearchQA is a question-answering benchmark designed to evaluate deep research capabilities. It requires models to search for information and provide accurate answers to complex questions across diverse categories.

Dataset Statistics#

Total Problems: 900
Answer Types: Single Answer, Set Answer
Categories: Politics & Government, Media & Entertainment, Education, Geography, Health, Science, Finance & Economics, Sports, Travel, History, Other

Evaluation Methodology#

Uses the official DeepMind LLM-as-judge methodology from the starter code:

Correctness Details: Per-answer-component correctness assessment
Excessive Answers: Detection of answers not in the ground truth
Metrics:
- Precision: Correct answers / (Correct + Excessive answers)
- Recall: Correct answers / Expected answers
- F1 Score: Harmonic mean of precision and recall
- Accuracy: Percentage of problems with all correct answers and no excessive answers

Prerequisites#

Dataset Setup#

The dataset is not included in the repository. Download it before running evaluation:

Download from Kaggle - DeepSearchQA
Place DSQA-full.csv in frontends/benchmarks/deepsearch_qa/data/

Judge model and API key#

The evaluator uses an LLM judge to score answers. The default config (config_deepsearch_qa.yml) uses OpenAI GPT-4o as the judge.

Choose a judge model (for example, OpenAI GPT-4o or Gemini 2.5 Flash).
Obtain an API key for the provider you chose.
Set the key in deploy/.env (for example, OPENAI_API_KEY=your_key) or export it.
To use a different judge (for example, Gemini), add a Gemini LLM under llms: in the config and set eval.evaluators.deepsearchqa.llm_name to that LLM name.

Other API keys#

Set in deploy/.env: NVIDIA_API_KEY (agent), TAVILY_API_KEY (web search).

Quick Start#

dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/deepsearch_qa/configs/config_deepsearch_qa.yml

Results are written to frontends/benchmarks/deepsearch_qa/results (or the output_dir in the config).

Scoring#

100 points: All expected answers correct, no excessive answers
75 points: All expected answers correct, but has excessive answers
0-50 points: Partial correctness (scaled by proportion correct)
0 points: No correct answers or empty response

References#

Configuration Files#

Config	Description
`configs/config_deepsearch_qa.yml`	Default: Nemotron on integrate.api, OpenAI judge. Use for quickstart.