RAG Evaluation Type#

RAG (Retrieval Augmented Generation) evaluation types are designed to measure the effectiveness of pipelines that retrieve relevant documents and generate answers based on retrieved content. Use this evaluation type to assess both retrieval and answer quality in RAG systems.

Metrics#

The RAG Pipeline evaluation includes two categories of metrics: document retireval and answer generation.

Document Retrieval#

The following table summarizes the available document retrieval metrics for RAG evaluation:

Metric Name	Description	How k is set	Notes
`recall@k`	Fraction of relevant documents retrieved in the top k results (higher is better)	User sets k (1 to top_k)	top_k is the Retriever’s configured value
`ndcg@k` / `ndcg_cut_k`	Normalized discounted cumulative gain (nDCG): ranking quality in information retrieval (higher is better)	User sets k (1 to top_k)	nDCG is normalized for comparability

Answer Generation#

The following table summarizes the available answer generation metrics for RAG evaluation, including their requirements and dependencies:

Metric Name	Description	Dataset Format(s)	Required Columns	Eval Config Model Endpoints
`faithfulness`	Factual consistency of the answer vs. context (0–1, higher is better)	beir, squad, ragas	question, answer, contexts	judge_llm
`answer_relevancy`	Relevancy of answer to prompt; penalizes incomplete/redundant answers	beir, squad, ragas	question, answer	judge_llm, judge_embeddings
`answer_correctness`	Accuracy vs. ground truth (0–1, higher is better)	ragas	question, answer, ground_truth	judge_llm, judge_embeddings
`answer_similarity`	Semantic similarity to ground truth (0–1, higher is better)	ragas	ground_truth, answer	judge_llm, judge_embeddings
`context_precision`	Precision of context ranking for ground-truth items (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`context_recall`	Recall: does context align with ground-truth answer? (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm

Legend:

judge_llm: Metric uses a large language model as a judge.
judge_embeddings: Metric uses embedding-based similarity.

Custom Dataset Format#

BEIR#

corpus.jsonl (BEIR)#

Fields

For BEIR, the corpus.jsonl file contains a list of dictionaries with the following fields:

Field	Type	Required	Description
`_id`	string	Yes	Unique document identifier.
`title`	string	No	Document title (optional).
`text`	string	Yes	Document paragraph or passage.

Example

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}

queries.jsonl (BEIR)#

Fields

The queries.jsonl file contains a list of dictionaries with the following fields:

Field	Type	Required	Description
`_id`	string	Yes	Unique query identifier.
`text`	string	Yes	Query text.

Example

{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}

qrels.tsv (BEIR)#

Fields

The qrels.tsv file is a tab-separated file with three columns: query-id, corpus-id, and score. The first row should be a header.

Field	Type	Required	Description
`query-id`	string	Yes	Query identifier (matches `_id` in queries.jsonl).
`corpus-id`	string	Yes	Document identifier (matches `_id` in corpus.jsonl).
`score`	integer	Yes	Relevance score (typically 1 for relevant, 0 for not relevant).

Example

query-id	corpus-id	score
q1	doc1	1

SQuAD#

squad.json (SQuAD)#

Fields

For SQuAD, the squad.json file contains question-answer pairs with their corresponding context passages in a structured format. It follows the official SQuAD dataset structure with nested fields for data, paragraphs, and question-answer pairs.

Field	Type	Required	Description
`data`	list of objects	Yes	List of data entries, each with paragraphs.
`paragraphs`	list of objects	Yes	List of paragraphs for each data entry.
`context`	string	Yes	Context passage for the questions.
`document_id`	string	Yes	Document identifier.
`qas`	list of objects	Yes	List of question-answer pairs.
`question`	string	Yes	The question being asked.
`id`	string	Yes	Unique identifier for the question.
`answers`	list of objects	Yes	List of answers, each with a `text` field.
`text`	string	Yes	The answer text (inside `answers`).

Example

{
   "data": [
      {
         "paragraphs": [
            {
               "context": "my context", 
               "document_id": "my id", 
               "qas": [
                  {
                     "question": "my question", 
                     "id": "my id", 
                     "answers": [
                        {"text": "my answer"}
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Ragas#

ragas.json (Ragas)#

Fields

For Ragas, the ragas.json file contains questions, contexts, answers, and ground truths for evaluating RAG systems. This format allows for comprehensive assessment of retrieval and generation quality.

Field	Type	Required	Description
`question`	list of strings	Yes	List of questions.
`contexts`	list of list of strings	No	List of context passages for each question.
`answer`	list of strings	No	List of predicted answers for each question.
`ground_truths`	list of strings	No	List of ground truth answers for each question.

Example

{
    "question": ["question #1", "question #2"],
    # Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
    "contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "answer": ["predicted answer for question #1", "predicted answer for question #2"],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "ground_truths": ["ground truth answer for question #1", "ground truth answer for question #2"]  
}

RAG Evaluation Type#

Options#

Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

Answer Evaluation (Pre-generated Answers)#

RAG (OpenAI-compatible Judge LLM)#

Metrics#

Document Retrieval#

Answer Generation#

Custom Dataset Format#

BEIR#

corpus.jsonl (BEIR)#

queries.jsonl (BEIR)#

qrels.tsv (BEIR)#

SQuAD#

squad.json (SQuAD)#

Ragas#

ragas.json (Ragas)#