FreshQA Evaluator#
A custom NeMo Agent Toolkit evaluator implementing the FreshEval Relaxed methodology for evaluating factual accuracy of model responses.
Overview#
This evaluator implements the FreshEval Relaxed evaluation methodology from FreshLLMs. It evaluates model responses under relaxed criteria where hallucinations, outdated information, and ill-formed answers are allowed, as long as the primary answer is accurate.
Installation#
# From the repository root
uv pip install -e ./frontends/benchmarks/freshqa
Prerequisites#
Judge model and API key#
The FreshQA evaluator uses an LLM judge. The default configs use OpenAI GPT-4o as the judge.
Choose a judge model (for example, OpenAI GPT-4o or Gemini 2.5 Flash).
Obtain an API key and set it in
deploy/.env(for example,OPENAI_API_KEY=your_key).To use a different judge, add that LLM under
llms:and seteval.evaluators.freshqa.llm_nameto its name.
Other API keys#
Set in deploy/.env: NVIDIA_API_KEY (agent), TAVILY_API_KEY (web search).
Quick Start#
# Shallow research only
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_shallow_research_only.yml
# Full workflow (orchestration + research agents)
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_full_workflow.yml
Results go to frontends/benchmarks/freshqa/results (or the config’s output_dir).
Evaluation Methodology#
The FreshEval Relaxed methodology:
Relaxed Criteria: Allows hallucinations, outdated information, and ill-formed answers as long as the primary answer is accurate.
Confident Answers Required: Credits responses only if they provide a confident and definitive answer, or the correct answer can be obviously inferred.
False Premise Handling: For false-premise questions, the response must explicitly point out the presence of a false premise to receive credit.
Name Accuracy: For answers involving names of entities (for example, people), complete names or commonly recognized names are expected.
Numerical Precision: Approximate numbers are generally not accepted unless explicitly included in the ground-truth answers.
Output Metrics#
The evaluator produces:
accuracy: An
AccuracyBreakdownobject containing per-category accuracy breakdowns (by fact type, number of hops, and false premise status)total_correct: Number of correctly answered questions
total_evaluated: Total number of items evaluated
average_score: Overall accuracy as a 0-1 ratio (for example, 0.75 means 75% correct)
Each item includes detailed reasoning with:
is_correct: Boolean indicating if the response was correctrating: “TRUE” or “FALSE”explanation: LLM’s explanation for the ratingquestion,model_response,correct_answers: Context for the evaluation
Dataset Format#
The evaluator expects a CSV file with the following columns:
question: The question to be answeredanswer_0throughanswer_9: Acceptable correct answers (can have multiple)split: Optional filter column (for example, “TEST”, “DEV”)
References#
Configuration Files#
Config |
Description |
|---|---|
|
Full workflow with OpenAI judge. Use for quickstart. |
|
Shallow research only with OpenAI judge. |
FreshQA Dataset Intro
The FreshQA benchmark dataset is designed to evaluate how well language models handle questions requiring up-to-date world knowledge.
Dataset Overview#
FreshQA categorizes questions along three key dimensions:
Dimension |
Values |
Description |
|---|---|---|
Fact Type |
|
How frequently the answer changes over time |
Num Hops |
|
Whether the question requires single or chained reasoning |
False Premise |
|
Whether the question contains an incorrect assumption |
Never-Changing Facts#
These questions have answers that remain constant over time.
One-Hop Examples#
Q: What is the largest mammal in the world? A: Blue whale
Q: Who founded Amazon? A: Jeff Bezos
Q: What is the capital of the commonwealth of Massachusetts? A: Boston
Q: On what date did the Berlin Wall fall? A: November 9, 1989
Q: Who painted The Starry Night? A: Vincent van Gogh
Multi-Hop Examples#
Q: What’s the capital of the largest state in America? A: Juneau (Alaska -> capital is Juneau)
Q: Which member of The Beatles was born first? A: Ringo Starr
Q: Where was the primary designer of AlexNet born? A: Ukraine (Alex Krizhevsky -> born in Ukraine)
Slow-Changing Facts#
These questions have answers that change infrequently, perhaps yearly or less often.
One-Hop Examples#
Q: Who is the CEO of Google? A: Sundar Pichai
Q: How many grand slams does Rafael Nadal have? A: 22
Q: What is the best-selling video game franchise of all time? A: Mario
Q: What is the top-ranked university in the US according to the US News Ranking? A: Princeton University
Q: Who is the current German chancellor? A: Friedrich Merz
Multi-Hop Examples#
Q: Which country has President Joe Biden visited the most during his presidency so far? A: United Kingdom
Q: Who has the most liked post on Instagram? A: Lionel Messi
Fast-Changing Facts#
These questions have answers that can change frequently, sometimes daily or weekly.
One-Hop Examples#
Q: Who is the richest man on earth? A: Elon Musk (as of dataset creation)
Q: Who is the current No. 1 on the Billboard Artist 100? A: Taylor Swift (as of dataset creation)
Q: What is the latest highest-grossing movie of the week at the Box office? A: (Changes weekly)
Q: What is Geoff Hinton’s h-index? A: 192 (as of dataset creation)
Multi-Hop Examples#
Q: How old is the richest person in the world? A: 54 years old (Elon Musk’s age at dataset creation)
Q: In what country was the most recent Time Person of the Year born? A: USA (Donald Trump -> born in USA)
False Premise Questions#
These questions contain incorrect assumptions that the model must identify and address.
Factual Corrections#
Q: What is the name of Leonardo DiCaprio’s third child? A: Leonardo DiCaprio does not have any children.
Q: When did the UK adopt the Euro? A: The United Kingdom has never adopted the Euro.
Q: How many home runs did Real Madrid score? A: Real Madrid is a professional soccer club and thus cannot score home runs. In soccer, players score goals, not home runs.
Temporal Corrections#
Q: When was the iPhone 9 released? A: Apple has never released an iPhone 9 nor does it plan to.
Q: When did the first human land on Mars? A: No humans have been to Mars yet.
Logical Corrections#
Q: By how much is 3 bigger than 4? A: 3 is smaller than 4 by 1.
Q: Which antibiotics are most effective against the flu? A: Antibiotics are only effective against bacteria while the flu is a virus.
Dataset Statistics#
Category |
Count |
|---|---|
Total Questions |
600 |
TEST Split |
500 |
DEV Split |
100 |
By Fact Type#
Never-changing: Questions with permanent answers
Slow-changing: Questions reviewed occasionally or yearly
Fast-changing: Questions requiring frequent updates
By Reasoning Complexity#
One-hop: Direct factual lookups
Multi-hop: Requires chaining multiple facts together
Source#
FreshQA benchmark dataset: FreshLLMs GitHub
For more information about the FreshQA benchmark methodology, refer to the original research paper.