RAG Evaluation Type#

RAG (Retrieval Augmented Generation) evaluation types are designed to measure the effectiveness of pipelines that retrieve relevant documents and generate answers based on retrieved content. Use this evaluation type to assess both retrieval and answer quality in RAG systems.

Options#

Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

{
    "type": "rag",
    "name": "rag-standard",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}
{
  "query": "What is the capital of France?",
  "retrieved_docs": [
    {"title": "France", "text": "Paris is the capital of France."}
  ],
  "reference": "Paris",
  "output": "Paris"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.9},
            "faithfulness": {"value": 1.0},
            "answer_relevancy": {"value": 1.0}
          }
        }
      }
    }
  }
}

Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

{
    "type": "rag",
    "name": "rag-custom",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}
{
  "query": "Who wrote Les Misérables?",
  "retrieved_docs": [
    {"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."}
  ],
  "reference": "Victor Hugo",
  "output": "Victor Hugo"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.95},
            "faithfulness": {"value": 1.0},
            "answer_relevancy": {"value": 1.0}
          }
        }
      }
    }
  }
}

Answer Evaluation (Pre-generated Answers)#

{
    "type": "rag",
    "name": "rag-ans-eval-pregen",
    "namespace": "my-organization",
    "tasks": {
        "my-ragas-task": {
            "type": "ragas",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"}
            }
        }
    }
}
{
  "query": "What is the main theme of Les Misérables?",
  "reference": "Redemption",
  "output": "Redemption"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.8},
            "faithfulness": {"value": 1.0}
          }
        }
      }
    }
  }
}

RAG (OpenAI-compatible Judge LLM)#

{
    "type": "rag",
    "name": "rag-openai-judge",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}
{
  "query": "What is the population of Paris?",
  "retrieved_docs": [
    {"title": "Paris", "text": "The population of Paris is over 2 million."}
  ],
  "reference": "over 2 million",
  "output": "over 2 million"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.92},
            "faithfulness": {"value": 1.0},
            "answer_relevancy": {"value": 1.0}
          }
        }
      }
    }
  }
}

Metrics#

The RAG Pipeline evaluation includes two categories of metrics: document retireval and answer generation.

Document Retrieval#

The following table summarizes the available document retrieval metrics for RAG evaluation:

Metric Name

Description

How k is set

Notes

recall@k

Fraction of relevant documents retrieved in the top k results (higher is better)

User sets k (1 to top_k)

top_k is the Retriever’s configured value

ndcg@k / ndcg_cut_k

Normalized discounted cumulative gain (nDCG): ranking quality in information retrieval (higher is better)

User sets k (1 to top_k)

nDCG is normalized for comparability

Answer Generation#

The following table summarizes the available answer generation metrics for RAG evaluation, including their requirements and dependencies:

Metric Name

Description

Dataset Format(s)

Required Columns

Eval Config Model Endpoints

faithfulness

Factual consistency of the answer vs. context (0–1, higher is better)

beir, squad, ragas

question, answer, contexts

judge_llm

answer_relevancy

Relevancy of answer to prompt; penalizes incomplete/redundant answers

beir, squad, ragas

question, answer

judge_llm, judge_embeddings

answer_correctness

Accuracy vs. ground truth (0–1, higher is better)

ragas

question, answer, ground_truth

judge_llm, judge_embeddings

answer_similarity

Semantic similarity to ground truth (0–1, higher is better)

ragas

ground_truth, answer

judge_llm, judge_embeddings

context_precision

Precision of context ranking for ground-truth items (0–1, higher is better)

ragas

question, contexts, ground_truth

judge_llm

context_recall

Recall: does context align with ground-truth answer? (0–1, higher is better)

ragas

question, contexts, ground_truth

judge_llm

Legend:

  • judge_llm: Metric uses a large language model as a judge.

  • judge_embeddings: Metric uses embedding-based similarity.

Custom Dataset Format#

BEIR#

corpus.jsonl (BEIR)#

For BEIR, the corpus.jsonl file contains a list of dictionaries with the following fields:

Field

Type

Required

Description

_id

string

Yes

Unique document identifier.

title

string

No

Document title (optional).

text

string

Yes

Document paragraph or passage.

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}

queries.jsonl (BEIR)#

The queries.jsonl file contains a list of dictionaries with the following fields:

Field

Type

Required

Description

_id

string

Yes

Unique query identifier.

text

string

Yes

Query text.

{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}

qrels.tsv (BEIR)#

The qrels.tsv file is a tab-separated file with three columns: query-id, corpus-id, and score. The first row should be a header.

Field

Type

Required

Description

query-id

string

Yes

Query identifier (matches _id in queries.jsonl).

corpus-id

string

Yes

Document identifier (matches _id in corpus.jsonl).

score

integer

Yes

Relevance score (typically 1 for relevant, 0 for not relevant).

query-id	corpus-id	score
q1	doc1	1

SQuAD#

squad.json (SQuAD)#

For SQuAD, the squad.json file contains question-answer pairs with their corresponding context passages in a structured format. It follows the official SQuAD dataset structure with nested fields for data, paragraphs, and question-answer pairs.

Field

Type

Required

Description

data

list of objects

Yes

List of data entries, each with paragraphs.

paragraphs

list of objects

Yes

List of paragraphs for each data entry.

context

string

Yes

Context passage for the questions.

document_id

string

Yes

Document identifier.

qas

list of objects

Yes

List of question-answer pairs.

question

string

Yes

The question being asked.

id

string

Yes

Unique identifier for the question.

answers

list of objects

Yes

List of answers, each with a text field.

text

string

Yes

The answer text (inside answers).

{
   "data": [
      {
         "paragraphs": [
            {
               "context": "my context", 
               "document_id": "my id", 
               "qas": [
                  {
                     "question": "my question", 
                     "id": "my id", 
                     "answers": [
                        {"text": "my answer"}
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Ragas#

ragas.json (Ragas)#

For Ragas, the ragas.json file contains questions, contexts, answers, and ground truths for evaluating RAG systems. This format allows for comprehensive assessment of retrieval and generation quality.

Field

Type

Required

Description

question

list of strings

Yes

List of questions.

contexts

list of list of strings

No

List of context passages for each question.

answer

list of strings

No

List of predicted answers for each question.

ground_truths

list of strings

No

List of ground truth answers for each question.

{
    "question": ["question #1", "question #2"],
    # Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
    "contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "answer": ["predicted answer for question #1", "predicted answer for question #2"],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "ground_truths": ["ground truth answer for question #1", "ground truth answer for question #2"]  
}