RAG Evaluation Type#
RAG (Retrieval Augmented Generation) evaluation types are designed to measure the effectiveness of pipelines that retrieve relevant documents and generate answers based on retrieved content. Use this evaluation type to assess both retrieval and answer quality in RAG systems.
Options#
Retrieval + Answer Generation + Answer Evaluation (Standard Data)#
{
"type": "rag",
"name": "rag-standard",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://nfcorpus/"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-judge-embedding-url>",
"model_id": "<my-judge-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}
{
"query": "What is the capital of France?",
"retrieved_docs": [
{"title": "France", "text": "Paris is the capital of France."}
],
"reference": "Paris",
"output": "Paris"
}
{
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"recall_5": {"value": 1.0},
"ndcg_cut_5": {"value": 0.9},
"faithfulness": {"value": 1.0},
"answer_relevancy": {"value": 1.0}
}
}
}
}
}
}
Retrieval + Answer Generation + Answer Evaluation (Custom Data)#
{
"type": "rag",
"name": "rag-custom",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-judge-embedding-url>",
"model_id": "<my-judge-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}
{
"query": "Who wrote Les Misérables?",
"retrieved_docs": [
{"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."}
],
"reference": "Victor Hugo",
"output": "Victor Hugo"
}
{
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"recall_5": {"value": 1.0},
"ndcg_cut_5": {"value": 0.95},
"faithfulness": {"value": 1.0},
"answer_relevancy": {"value": 1.0}
}
}
}
}
}
}
Answer Evaluation (Pre-generated Answers)#
{
"type": "rag",
"name": "rag-ans-eval-pregen",
"namespace": "my-organization",
"tasks": {
"my-ragas-task": {
"type": "ragas",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-query-embedding-url>",
"model_id": "<my-query-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"}
}
}
}
}
{
"query": "What is the main theme of Les Misérables?",
"reference": "Redemption",
"output": "Redemption"
}
{
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"recall_5": {"value": 1.0},
"ndcg_cut_5": {"value": 0.8},
"faithfulness": {"value": 1.0}
}
}
}
}
}
}
RAG (OpenAI-compatible Judge LLM)#
{
"type": "rag",
"name": "rag-openai-judge",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://nfcorpus/"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>",
"api_key": "<openai-api-key>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-query-embedding-url>",
"model_id": "<my-query-embedding-model>",
"api_key": "<openai-api-key>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}
{
"query": "What is the population of Paris?",
"retrieved_docs": [
{"title": "Paris", "text": "The population of Paris is over 2 million."}
],
"reference": "over 2 million",
"output": "over 2 million"
}
{
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"recall_5": {"value": 1.0},
"ndcg_cut_5": {"value": 0.92},
"faithfulness": {"value": 1.0},
"answer_relevancy": {"value": 1.0}
}
}
}
}
}
}
Metrics#
The RAG Pipeline evaluation includes two categories of metrics: document retireval and answer generation.
Document Retrieval#
The following table summarizes the available document retrieval metrics for RAG evaluation:
Metric Name |
Description |
How k is set |
Notes |
---|---|---|---|
|
Fraction of relevant documents retrieved in the top k results (higher is better) |
User sets k (1 to top_k) |
top_k is the Retriever’s configured value |
|
Normalized discounted cumulative gain (nDCG): ranking quality in information retrieval (higher is better) |
User sets k (1 to top_k) |
nDCG is normalized for comparability |
Answer Generation#
The following table summarizes the available answer generation metrics for RAG evaluation, including their requirements and dependencies:
Metric Name |
Description |
Dataset Format(s) |
Required Columns |
Eval Config Model Endpoints |
---|---|---|---|---|
|
Factual consistency of the answer vs. context (0–1, higher is better) |
beir, squad, ragas |
question, answer, contexts |
judge_llm |
|
Relevancy of answer to prompt; penalizes incomplete/redundant answers |
beir, squad, ragas |
question, answer |
judge_llm, judge_embeddings |
|
Accuracy vs. ground truth (0–1, higher is better) |
ragas |
question, answer, ground_truth |
judge_llm, judge_embeddings |
|
Semantic similarity to ground truth (0–1, higher is better) |
ragas |
ground_truth, answer |
judge_llm, judge_embeddings |
|
Precision of context ranking for ground-truth items (0–1, higher is better) |
ragas |
question, contexts, ground_truth |
judge_llm |
|
Recall: does context align with ground-truth answer? (0–1, higher is better) |
ragas |
question, contexts, ground_truth |
judge_llm |
Legend:
judge_llm
: Metric uses a large language model as a judge.judge_embeddings
: Metric uses embedding-based similarity.
Custom Dataset Format#
BEIR#
corpus.jsonl (BEIR)#
For BEIR, the corpus.jsonl
file contains a list of dictionaries with the following fields:
Field |
Type |
Required |
Description |
---|---|---|---|
|
string |
Yes |
Unique document identifier. |
|
string |
No |
Document title (optional). |
|
string |
Yes |
Document paragraph or passage. |
{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
queries.jsonl (BEIR)#
The queries.jsonl
file contains a list of dictionaries with the following fields:
Field |
Type |
Required |
Description |
---|---|---|---|
|
string |
Yes |
Unique query identifier. |
|
string |
Yes |
Query text. |
{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels.tsv (BEIR)#
The qrels.tsv
file is a tab-separated file with three columns: query-id
, corpus-id
, and score
. The first row should be a header.
Field |
Type |
Required |
Description |
---|---|---|---|
|
string |
Yes |
Query identifier (matches |
|
string |
Yes |
Document identifier (matches |
|
integer |
Yes |
Relevance score (typically 1 for relevant, 0 for not relevant). |
query-id corpus-id score
q1 doc1 1
SQuAD#
squad.json (SQuAD)#
For SQuAD, the squad.json
file contains question-answer pairs with their corresponding context passages in a structured format. It follows the official SQuAD dataset structure with nested fields for data, paragraphs, and question-answer pairs.
Field |
Type |
Required |
Description |
---|---|---|---|
|
list of objects |
Yes |
List of data entries, each with paragraphs. |
|
list of objects |
Yes |
List of paragraphs for each data entry. |
|
string |
Yes |
Context passage for the questions. |
|
string |
Yes |
Document identifier. |
|
list of objects |
Yes |
List of question-answer pairs. |
|
string |
Yes |
The question being asked. |
|
string |
Yes |
Unique identifier for the question. |
|
list of objects |
Yes |
List of answers, each with a |
|
string |
Yes |
The answer text (inside |
{
"data": [
{
"paragraphs": [
{
"context": "my context",
"document_id": "my id",
"qas": [
{
"question": "my question",
"id": "my id",
"answers": [
{"text": "my answer"}
]
}
]
}
]
}
]
}
Ragas#
ragas.json (Ragas)#
For Ragas, the ragas.json
file contains questions, contexts, answers, and ground truths for evaluating RAG systems. This format allows for comprehensive assessment of retrieval and generation quality.
Field |
Type |
Required |
Description |
---|---|---|---|
|
list of strings |
Yes |
List of questions. |
|
list of list of strings |
No |
List of context passages for each question. |
|
list of strings |
No |
List of predicted answers for each question. |
|
list of strings |
No |
List of ground truth answers for each question. |
{
"question": ["question #1", "question #2"],
# Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
"contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],
# Optional. Used for Answer Evaluation (for some specific RAG metrics)
"answer": ["predicted answer for question #1", "predicted answer for question #2"],
# Optional. Used for Answer Evaluation (for some specific RAG metrics)
"ground_truths": ["ground truth answer for question #1", "ground truth answer for question #2"]
}