Is this page helpful?

FreshQA Evaluator#

A custom NeMo Agent Toolkit evaluator implementing the FreshEval Relaxed methodology for evaluating factual accuracy of model responses.

Overview#

This evaluator implements the FreshEval Relaxed evaluation methodology from FreshLLMs. It evaluates model responses under relaxed criteria where hallucinations, outdated information, and ill-formed answers are allowed, as long as the primary answer is accurate.

Installation#

# From the repository root
uv pip install -e ./frontends/benchmarks/freshqa

Prerequisites#

Judge model and API key#

The FreshQA evaluator uses an LLM judge. The default configs use OpenAI GPT-4o as the judge.

Choose a judge model (for example, OpenAI GPT-4o or Gemini 2.5 Flash).
Obtain an API key and set it in deploy/.env (for example, OPENAI_API_KEY=your_key).
To use a different judge, add that LLM under llms: and set eval.evaluators.freshqa.llm_name to its name.

Other API keys#

Set in deploy/.env: NVIDIA_API_KEY (agent), TAVILY_API_KEY (web search).

Quick Start#

# Shallow research only
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_shallow_research_only.yml

# Full workflow (orchestration + research agents)
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_full_workflow.yml

Results go to frontends/benchmarks/freshqa/results (or the config’s output_dir).

Evaluation Methodology#

The FreshEval Relaxed methodology:

Relaxed Criteria: Allows hallucinations, outdated information, and ill-formed answers as long as the primary answer is accurate.
Confident Answers Required: Credits responses only if they provide a confident and definitive answer, or the correct answer can be obviously inferred.
False Premise Handling: For false-premise questions, the response must explicitly point out the presence of a false premise to receive credit.
Name Accuracy: For answers involving names of entities (for example, people), complete names or commonly recognized names are expected.
Numerical Precision: Approximate numbers are generally not accepted unless explicitly included in the ground-truth answers.

Output Metrics#

The evaluator produces:

accuracy: An AccuracyBreakdown object containing per-category accuracy breakdowns (by fact type, number of hops, and false premise status)
total_correct: Number of correctly answered questions
total_evaluated: Total number of items evaluated
average_score: Overall accuracy as a 0-1 ratio (for example, 0.75 means 75% correct)

Each item includes detailed reasoning with:

is_correct: Boolean indicating if the response was correct
rating: “TRUE” or “FALSE”
explanation: LLM’s explanation for the rating
question, model_response, correct_answers: Context for the evaluation

Dataset Format#

The evaluator expects a CSV file with the following columns:

question: The question to be answered
answer_0 through answer_9: Acceptable correct answers (can have multiple)
split: Optional filter column (for example, “TEST”, “DEV”)

References#

Configuration Files#

Config	Description
`configs/config_full_workflow.yml`	Full workflow with OpenAI judge. Use for quickstart.
`configs/config_shallow_research_only.yml`	Shallow research only with OpenAI judge.

FreshQA Dataset Intro

The FreshQA benchmark dataset is designed to evaluate how well language models handle questions requiring up-to-date world knowledge.

Dataset Overview#

FreshQA categorizes questions along three key dimensions:

Dimension	Values	Description
Fact Type	`never-changing`, `slow-changing`, `fast-changing`	How frequently the answer changes over time
Num Hops	`one-hop`, `multi-hop`	Whether the question requires single or chained reasoning
False Premise	`True`, `False`	Whether the question contains an incorrect assumption

Never-Changing Facts#

These questions have answers that remain constant over time.

One-Hop Examples#

Q: What is the largest mammal in the world? A: Blue whale

Q: Who founded Amazon? A: Jeff Bezos

Q: What is the capital of the commonwealth of Massachusetts? A: Boston

Q: On what date did the Berlin Wall fall? A: November 9, 1989

Q: Who painted The Starry Night? A: Vincent van Gogh

Multi-Hop Examples#

Q: What’s the capital of the largest state in America? A: Juneau (Alaska -> capital is Juneau)

Q: Which member of The Beatles was born first? A: Ringo Starr

Q: Where was the primary designer of AlexNet born? A: Ukraine (Alex Krizhevsky -> born in Ukraine)

Slow-Changing Facts#

These questions have answers that change infrequently, perhaps yearly or less often.

One-Hop Examples#

Q: Who is the CEO of Google? A: Sundar Pichai

Q: How many grand slams does Rafael Nadal have? A: 22

Q: What is the best-selling video game franchise of all time? A: Mario

Q: What is the top-ranked university in the US according to the US News Ranking? A: Princeton University

Q: Who is the current German chancellor? A: Friedrich Merz

Multi-Hop Examples#

Q: Which country has President Joe Biden visited the most during his presidency so far? A: United Kingdom

Q: Who has the most liked post on Instagram? A: Lionel Messi

Fast-Changing Facts#

These questions have answers that can change frequently, sometimes daily or weekly.

One-Hop Examples#

Q: Who is the richest man on earth? A: Elon Musk (as of dataset creation)

Q: Who is the current No. 1 on the Billboard Artist 100? A: Taylor Swift (as of dataset creation)

Q: What is the latest highest-grossing movie of the week at the Box office? A: (Changes weekly)

Q: What is Geoff Hinton’s h-index? A: 192 (as of dataset creation)

Multi-Hop Examples#

Q: How old is the richest person in the world? A: 54 years old (Elon Musk’s age at dataset creation)

Q: In what country was the most recent Time Person of the Year born? A: USA (Donald Trump -> born in USA)

False Premise Questions#

These questions contain incorrect assumptions that the model must identify and address.

Factual Corrections#

Q: What is the name of Leonardo DiCaprio’s third child? A: Leonardo DiCaprio does not have any children.

Q: When did the UK adopt the Euro? A: The United Kingdom has never adopted the Euro.

Q: How many home runs did Real Madrid score? A: Real Madrid is a professional soccer club and thus cannot score home runs. In soccer, players score goals, not home runs.

Temporal Corrections#

Q: When was the iPhone 9 released? A: Apple has never released an iPhone 9 nor does it plan to.

Q: When did the first human land on Mars? A: No humans have been to Mars yet.

Logical Corrections#

Q: By how much is 3 bigger than 4? A: 3 is smaller than 4 by 1.

Q: Which antibiotics are most effective against the flu? A: Antibiotics are only effective against bacteria while the flu is a virus.

Dataset Statistics#

Category	Count
Total Questions	600
TEST Split	500
DEV Split	100

By Fact Type#

Never-changing: Questions with permanent answers
Slow-changing: Questions reviewed occasionally or yearly
Fast-changing: Questions requiring frequent updates

By Reasoning Complexity#

One-hop: Direct factual lookups
Multi-hop: Requires chaining multiple facts together

Source#

FreshQA benchmark dataset: FreshLLMs GitHub

For more information about the FreshQA benchmark methodology, refer to the original research paper.