RAG Evaluation Type#

RAG (Retrieval Augmented Generation) evaluation types are designed to measure the effectiveness of pipelines that retrieve relevant documents and generate answers based on retrieved content. Use this evaluation type to assess both retrieval and answer quality in RAG systems.

Prerequisites#

Before running RAG evaluations, ensure you have:

For custom datasets:

Uploaded your dataset to NeMo Data Store using Hugging Face CLI or SDK
Registered your dataset in NeMo Entity Store using the Dataset APIs
Formatted your data according to the RAG data format requirements (BEIR, SQuAD, or RAGAS)

For all RAG evaluations:

Access to embedding models for retrieval and evaluation
Judge LLM and embedding models for answer evaluation metrics
Proper API endpoints configured for your pipeline components

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.

Authentication for External Services#

RAG evaluations support API key authentication for external services used in your pipeline components. This enables secure integration with third-party embedding models, reranking services, and LLMs.

Tip

For comprehensive authentication configuration examples and security best practices, refer to API Key Authentication.

Common Authentication Scenarios#

External embedding models (OpenAI, Cohere, etc.)
Third-party reranking services
External LLMs for answer generation
Judge models for evaluation metrics

Add the api_key field to any api_endpoint configuration:

{
  "api_endpoint": {
    "url": "https://api.openai.com/v1/embeddings",
    "model_id": "text-embedding-3-large",
    "api_key": "sk-your-openai-key"
  }
}

Metrics#

The RAG Pipeline evaluation includes two categories of metrics: document retireval and answer generation.

Note

All metrics follow the rag_ prefix convention and use the ragas type unless otherwise specified. For retrieval metrics, use the retriever_ prefix with pytrec_eval type.

Document Retrieval#

The following table summarizes the available document retrieval metrics for RAG evaluation:

Metric Name	Description	How k is set	Notes
`recall@k`	Fraction of relevant documents retrieved in the top k results (higher is better)	User sets k (1 to top_k)	top_k is the Retriever’s configured value
`ndcg@k` / `ndcg_cut_k`	Normalized discounted cumulative gain (nDCG): ranking quality in information retrieval (higher is better)	User sets k (1 to top_k)	nDCG is normalized for comparability

Answer Generation#

The following table summarizes the available answer generation metrics for RAG evaluation, including their requirements and dependencies:

Metric Name	Description	Dataset Format(s)	Required Columns	Eval Config Model Endpoints
`rag_faithfulness`	Factual consistency of the answer vs. context (0–1, higher is better)	beir, squad, ragas	question, answer, contexts	judge_llm
`rag_answer_relevancy`	Relevancy of answer to prompt; penalizes incomplete/redundant answers	beir, squad, ragas	question, answer	judge_llm, judge_embeddings
`rag_answer_correctness`	Accuracy vs. ground truth (0–1, higher is better)	ragas	question, answer, ground_truth	judge_llm, judge_embeddings
`rag_answer_similarity`	Semantic similarity to ground truth (0–1, higher is better)	ragas	ground_truth, answer	judge_llm, judge_embeddings
`rag_context_precision`	Precision of context ranking for ground-truth items (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`rag_context_recall`	Recall: does context align with ground-truth answer? (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`rag_answer_accuracy`	Agreement between model response and reference ground truth via dual LLM-as-a-judge evaluation (0, 1, 2; higher is better)	ragas	question, answer, ground_truth	judge_llm, judge_embeddings
`rag_context_relevance`	Evaluates whether retrieved contexts are pertinent to user input via dual LLM-as-a-judge assessment (0, 1, 2; higher is better)	ragas	question, contexts	judge_llm
`rag_response_groundedness`	Measures how well response claims are supported by retrieved contexts and can be found within them (0, 1, 2; higher is better)	ragas	question, answer, contexts	judge_llm
`rag_context_entity_recall`	Recall of entities in context compared to ground truth (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`rag_noise_sensitivity`	Measures robustness to irrelevant context (0–1, lower is better)	ragas	question, answer, contexts	judge_llm
`rag_response_relevancy`	Overall relevancy of the response to the query (0–1, higher is better)	ragas	question, answer	judge_llm, judge_embeddings

Legend:

judge_llm: Metric uses a large language model as a judge.
judge_embeddings: Metric uses embedding-based similarity.

Custom Dataset Format#

BEIR#

The BEIR (Benchmarking Information Retrieval) framework supports various datasets for evaluating retrieval systems. Supported BEIR datasets include:

fiqa - Financial question answering dataset
nfcorpus - Natural language corpus for biomedical information retrieval
scidocs - Scientific document retrieval and citation recommendation
scifact - Scientific fact verification dataset

Note

For a complete list of available BEIR datasets, refer to the BEIR repository.

corpus.jsonl (BEIR)#

Fields

For BEIR, the corpus.jsonl file contains a list of dictionaries with the following fields:

Field	Type	Required	Description
`_id`	string	Yes	Unique document identifier.
`title`	string	No	Document title (optional).
`text`	string	Yes	Document paragraph or passage.

Example

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}

queries.jsonl (BEIR)#

Fields

The queries.jsonl file contains a list of dictionaries with the following fields:

Field	Type	Required	Description
`_id`	string	Yes	Unique query identifier.
`text`	string	Yes	Query text.

Example

{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}

qrels.tsv (BEIR)#

Fields

The qrels.tsv file is a tab-separated file with three columns: query-id, corpus-id, and score. The first row should be a header.

Field	Type	Required	Description
`query-id`	string	Yes	Query identifier (matches `_id` in queries.jsonl).
`corpus-id`	string	Yes	Document identifier (matches `_id` in corpus.jsonl).
`score`	integer	Yes	Relevance score (typically 1 for relevant, 0 for not relevant).

Example

query-id	corpus-id	score
q1	doc1	1

SQuAD#

squad.json (SQuAD)#

Fields

For SQuAD, the squad.json file contains question-answer pairs with their corresponding context passages in a structured format. It follows the official SQuAD dataset structure with nested fields for data, paragraphs, and question-answer pairs.

Field	Type	Required	Description
`data`	list of objects	Yes	List of data entries, each with paragraphs.
`paragraphs`	list of objects	Yes	List of paragraphs for each data entry.
`context`	string	Yes	Context passage for the questions.
`document_id`	string	Yes	Document identifier.
`qas`	list of objects	Yes	List of question-answer pairs.
`question`	string	Yes	The question being asked.
`id`	string	Yes	Unique identifier for the question.
`answers`	list of objects	Yes	List of answers, each with a `text` field.
`text`	string	Yes	The answer text (inside `answers`).

Example

{
   "data": [
      {
         "paragraphs": [
            {
               "context": "my context", 
               "document_id": "my id", 
               "qas": [
                  {
                     "question": "my question", 
                     "id": "my id", 
                     "answers": [
                        {"text": "my answer"}
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Ragas#

ragas.json (Ragas)#

Fields

For Ragas, the ragas.json file contains questions, contexts, answers, and ground truths for evaluating RAG systems. This format allows for comprehensive assessment of retrieval and generation quality.

Field	Type	Required	Description
`question`	list of strings	Yes	List of questions.
`contexts`	list of list of strings	No	List of context passages for each question.
`answer`	list of strings	No	List of predicted answers for each question.
`ground_truths`	list of strings	No	List of ground truth answers for each question.

Example

{
    "question": ["question #1", "question #2"],
    # Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
    "contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "answer": ["predicted answer for question #1", "predicted answer for question #2"],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "ground_truths": ["ground truth answer for question #1", "ground truth answer for question #2"]  
}

RAG Evaluation Type#

Prerequisites#

Authentication for External Services#

Common Authentication Scenarios#

Options#

Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

Answer Evaluation (Pre-generated Answers)#

RAG (OpenAI-compatible Judge LLM)#

RAG (using RAGAS NVIDIA Metrics)#

Metrics#

Document Retrieval#

Answer Generation#

Custom Dataset Format#

BEIR#

corpus.jsonl (BEIR)#

queries.jsonl (BEIR)#

qrels.tsv (BEIR)#

SQuAD#

squad.json (SQuAD)#

Ragas#

ragas.json (Ragas)#