RAG Evaluation Type#

RAG (Retrieval Augmented Generation) evaluation types are designed to measure the effectiveness of pipelines that retrieve relevant documents and generate answers based on retrieved content. Use this evaluation type to assess both retrieval and answer quality in RAG systems.

Prerequisites#

Before running RAG evaluations, ensure you have:

For custom datasets:

For all RAG evaluations:

  • Access to embedding models for retrieval and evaluation

  • Judge LLM and embedding models for answer evaluation metrics

  • Proper API endpoints configured for your pipeline components

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.


Authentication for External Services#

RAG evaluations support API key authentication for external services used in your pipeline components. This enables secure integration with third-party embedding models, reranking services, and LLMs.

Tip

For comprehensive authentication configuration examples and security best practices, refer to API Key Authentication.

Common Authentication Scenarios#

  • External embedding models (OpenAI, Cohere, etc.)

  • Third-party reranking services

  • External LLMs for answer generation

  • Judge models for evaluation metrics

Add the api_key field to any api_endpoint configuration:

{
  "api_endpoint": {
    "url": "https://api.openai.com/v1/embeddings",
    "model_id": "text-embedding-3-large",
    "api_key": "sk-your-openai-key"
  }
}

Options#

Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

{
    "type": "rag",
    "name": "rag-standard",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"},
                "rag_answer_relevancy": {"type": "ragas"}
            }
        }
    }
}
{
  "query": "What is the capital of France?",
  "retrieved_docs": [
    {"title": "France", "text": "Paris is the capital of France."}
  ],
  "reference": "Paris",
  "output": "Paris"
}
{
  "job": "eval-abc123def456",
  "files_url": "hf://datasets/evaluation-results/eval-abc123def456",
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "rag_answer_relevancy": {
          "scores": {
            "answer_relevancy": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.9,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  },
  "namespace": "my-organization",
  "custom_fields": {}
}

Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

{
    "type": "rag",
    "name": "rag-custom",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"},
                "rag_answer_relevancy": {"type": "ragas"}
            }
        }
    }
}
{
  "query": "Who wrote Les Misérables?",
  "retrieved_docs": [
    {"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."}
  ],
  "reference": "Victor Hugo",
  "output": "Victor Hugo"
}
{
  "job": "eval-def789ghi012",
  "files_url": "hf://datasets/evaluation-results/eval-def789ghi012",
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "rag_answer_relevancy": {
          "scores": {
            "answer_relevancy": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.95,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  },
  "namespace": "my-organization",
  "custom_fields": {}
}

Answer Evaluation (Pre-generated Answers)#

{
    "type": "rag",
    "name": "rag-ans-eval-pregen",
    "namespace": "my-organization",
    "tasks": {
        "my-ragas-task": {
            "type": "ragas",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"}
            }
        }
    }
}
{
  "query": "What is the main theme of Les Misérables?",
  "reference": "Redemption",
  "output": "Redemption"
}
{
  "job": "eval-ghi345jkl678",
  "files_url": "hf://datasets/evaluation-results/eval-ghi345jkl678",
  "tasks": {
    "my-ragas-task": {
      "metrics": {
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.8,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  },
  "namespace": "my-organization",
  "custom_fields": {}
}

RAG (OpenAI-compatible Judge LLM)#

{
    "type": "rag",
    "name": "rag-openai-judge",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/v1/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"},
                "rag_answer_relevancy": {"type": "ragas"}
            }
        }
    }
}
{
  "query": "What is the population of Paris?",
  "retrieved_docs": [
    {"title": "Paris", "text": "The population of Paris is over 2 million."}
  ],
  "reference": "over 2 million",
  "output": "over 2 million"
}
{
  "job": "eval-jkl901mno234",
  "files_url": "hf://datasets/evaluation-results/eval-jkl901mno234",
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "rag_answer_relevancy": {
          "scores": {
            "answer_relevancy": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.92,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  },
  "namespace": "my-organization",
  "custom_fields": {}
}

RAG (using RAGAS NVIDIA Metrics)#

{
  "target": {
    "type": "rag",
    "rag": {
      "pipeline": {
        "retriever": {
          "pipeline": {
            "query_embedding_model": {
              "api_endpoint": {
                "url": "<my-query-embedding-url>",
                "model_id": "<my-query-embedding-model>",
                "format": "nim"
              }
            },
            "index_embedding_model": {
              "api_endpoint": {
                "url": "<my-index-embedding-url>",
                "model_id": "<my-index-embedding-model>",
                "format": "nim"
              }
            },
            "top_k": 1
          }
        },
        "model": {
          "api_endpoint": {
            "url": "<my-model-url>",
            "model_id": "<my-model-id>",
            "format": "nim"
          }
        }
      }
    }
  },
  "config": {
    "type": "rag",
    "tasks": {
      "rag": {
        "type": "ragas",
        "dataset": {
          "files_url": "<my-dataset-url>",
          "format": "ragas"
        },
        "metrics": {
           "retriever_recall_5": {
             "type": "pytrec_eval"
           },
           "rag_answer_accuracy": {
             "type": "ragas"
           },
           "rag_context_relevance": {
             "type": "ragas"
           },
           "rag_response_groundedness": {
             "type": "ragas"
           }
         },
        "params": {
          "judge_llm": {
            "api_endpoint": {
              "url": "<my-judge-llm-url>",
              "model_id": "<my-judge-llm-model>"
            }
          },
          "judge_embeddings": {
            "api_endpoint": {
              "url": "<my-judge-embedding-url>",
              "model_id": "<my-judge-embedding-model>"
            }
          },
          "judge_timeout": 120,
          "judge_max_retries": 5,
          "judge_max_workers": 24,
          "judge_max_token": 2048,
          "judge_llm_top_p": 1.0
        }
      }
    }
  }
}
{
  "question": "What are the key features of NVIDIA NIMs?",
  "contexts": [
    "NVIDIA NIMs are containerized microservices that provide optimized inference for AI models."
  ],
  "answer": "NVIDIA NIMs offer containerized AI inference with optimized performance and scalability.",
  "ground_truths": ["NVIDIA NIMs are optimized containerized inference services for AI models."]
}
{
  "tasks": {
    "rag": {
      "metrics": {
        "rag_nv_accuracy": {
          "scores": {
            "nv_accuracy": {
              "value": 0.95,
              "stats": {}
            }
          }
        },
        "rag_nv_context_relevance": {
          "scores": {
            "nv_context_relevance": {
              "value": 0.92,
              "stats": {}
            }
          }
        },
        "rag_nv_response_groundedness": {
          "scores": {
            "nv_response_groundedness": {
              "value": 0.98,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Metrics#

The RAG Pipeline evaluation includes two categories of metrics: document retireval and answer generation.

Note

All metrics follow the rag_ prefix convention and use the ragas type unless otherwise specified. For retrieval metrics, use the retriever_ prefix with pytrec_eval type.

Document Retrieval#

The following table summarizes the available document retrieval metrics for RAG evaluation:

Metric Name

Description

How k is set

Notes

recall@k

Fraction of relevant documents retrieved in the top k results (higher is better)

User sets k (1 to top_k)

top_k is the Retriever’s configured value

ndcg@k / ndcg_cut_k

Normalized discounted cumulative gain (nDCG): ranking quality in information retrieval (higher is better)

User sets k (1 to top_k)

nDCG is normalized for comparability

Answer Generation#

The following table summarizes the available answer generation metrics for RAG evaluation, including their requirements and dependencies:

Metric Name

Description

Dataset Format(s)

Required Columns

Eval Config Model Endpoints

rag_faithfulness

Factual consistency of the answer vs. context (0–1, higher is better)

beir, squad, ragas

question, answer, contexts

judge_llm

rag_answer_relevancy

Relevancy of answer to prompt; penalizes incomplete/redundant answers

beir, squad, ragas

question, answer

judge_llm, judge_embeddings

rag_answer_correctness

Accuracy vs. ground truth (0–1, higher is better)

ragas

question, answer, ground_truth

judge_llm, judge_embeddings

rag_answer_similarity

Semantic similarity to ground truth (0–1, higher is better)

ragas

ground_truth, answer

judge_llm, judge_embeddings

rag_context_precision

Precision of context ranking for ground-truth items (0–1, higher is better)

ragas

question, contexts, ground_truth

judge_llm

rag_context_recall

Recall: does context align with ground-truth answer? (0–1, higher is better)

ragas

question, contexts, ground_truth

judge_llm

rag_answer_accuracy

Agreement between model response and reference ground truth via dual LLM-as-a-judge evaluation (0, 1, 2; higher is better)

ragas

question, answer, ground_truth

judge_llm, judge_embeddings

rag_context_relevance

Evaluates whether retrieved contexts are pertinent to user input via dual LLM-as-a-judge assessment (0, 1, 2; higher is better)

ragas

question, contexts

judge_llm

rag_response_groundedness

Measures how well response claims are supported by retrieved contexts and can be found within them (0, 1, 2; higher is better)

ragas

question, answer, contexts

judge_llm

rag_context_entity_recall

Recall of entities in context compared to ground truth (0–1, higher is better)

ragas

question, contexts, ground_truth

judge_llm

rag_noise_sensitivity

Measures robustness to irrelevant context (0–1, lower is better)

ragas

question, answer, contexts

judge_llm

rag_response_relevancy

Overall relevancy of the response to the query (0–1, higher is better)

ragas

question, answer

judge_llm, judge_embeddings

Legend:

  • judge_llm: Metric uses a large language model as a judge.

  • judge_embeddings: Metric uses embedding-based similarity.

Custom Dataset Format#

BEIR#

The BEIR (Benchmarking Information Retrieval) framework supports various datasets for evaluating retrieval systems. Supported BEIR datasets include:

  • fiqa - Financial question answering dataset

  • nfcorpus - Natural language corpus for biomedical information retrieval

  • scidocs - Scientific document retrieval and citation recommendation

  • scifact - Scientific fact verification dataset

Note

For a complete list of available BEIR datasets, refer to the BEIR repository.

corpus.jsonl (BEIR)#

For BEIR, the corpus.jsonl file contains a list of dictionaries with the following fields:

Field

Type

Required

Description

_id

string

Yes

Unique document identifier.

title

string

No

Document title (optional).

text

string

Yes

Document paragraph or passage.

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}

queries.jsonl (BEIR)#

The queries.jsonl file contains a list of dictionaries with the following fields:

Field

Type

Required

Description

_id

string

Yes

Unique query identifier.

text

string

Yes

Query text.

{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}

qrels.tsv (BEIR)#

The qrels.tsv file is a tab-separated file with three columns: query-id, corpus-id, and score. The first row should be a header.

Field

Type

Required

Description

query-id

string

Yes

Query identifier (matches _id in queries.jsonl).

corpus-id

string

Yes

Document identifier (matches _id in corpus.jsonl).

score

integer

Yes

Relevance score (typically 1 for relevant, 0 for not relevant).

query-id	corpus-id	score
q1	doc1	1

SQuAD#

squad.json (SQuAD)#

For SQuAD, the squad.json file contains question-answer pairs with their corresponding context passages in a structured format. It follows the official SQuAD dataset structure with nested fields for data, paragraphs, and question-answer pairs.

Field

Type

Required

Description

data

list of objects

Yes

List of data entries, each with paragraphs.

paragraphs

list of objects

Yes

List of paragraphs for each data entry.

context

string

Yes

Context passage for the questions.

document_id

string

Yes

Document identifier.

qas

list of objects

Yes

List of question-answer pairs.

question

string

Yes

The question being asked.

id

string

Yes

Unique identifier for the question.

answers

list of objects

Yes

List of answers, each with a text field.

text

string

Yes

The answer text (inside answers).

{
   "data": [
      {
         "paragraphs": [
            {
               "context": "my context", 
               "document_id": "my id", 
               "qas": [
                  {
                     "question": "my question", 
                     "id": "my id", 
                     "answers": [
                        {"text": "my answer"}
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Ragas#

ragas.json (Ragas)#

For Ragas, the ragas.json file contains questions, contexts, answers, and ground truths for evaluating RAG systems. This format allows for comprehensive assessment of retrieval and generation quality.

Field

Type

Required

Description

question

list of strings

Yes

List of questions.

contexts

list of list of strings

No

List of context passages for each question.

answer

list of strings

No

List of predicted answers for each question.

ground_truths

list of strings

No

List of ground truth answers for each question.

{
    "question": ["question #1", "question #2"],
    # Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
    "contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "answer": ["predicted answer for question #1", "predicted answer for question #2"],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "ground_truths": ["ground truth answer for question #1", "ground truth answer for question #2"]  
}