RAG Evaluation Flow#

RAG (Retrieval Augmented Generation) evaluation types are designed to measure the effectiveness of pipelines that retrieve relevant documents and generate answers based on retrieved content. Use this evaluation type to assess both retrieval and answer quality in RAG systems.

Prerequisites#

Before running RAG evaluations, ensure you have:

Access to embedding models for retrieval and evaluation
Judge LLM and embedding models for answer evaluation metrics
Proper API endpoints configured for your pipeline components

Custom Datasets

Upload your dataset to NeMo Data Store using Hugging Face CLI or SDK
Register your dataset in NeMo Entity Store using the Dataset APIs
Format your data according to the RAG data format requirements (BEIR, SQuAD, or RAGAS)

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "dataset",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            }
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "dataset",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "dataset",
        "name": "my-target-dataset-1",
        "namespace": "my-organization",
        "dataset": {
            "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
        }
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "dataset",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Authentication for External Services#

Use API keys to authenticate to external providers in your pipeline components, such as embedding models, reranking services, LLMs, and judge models.

Tip

For comprehensive authentication configuration examples and security best practices, refer to API Key Authentication.

Add api_key to an api_endpoint configuration:

{
  "api_endpoint": {
    "url": "https://api.openai.com/v1/embeddings",
    "model_id": "text-embedding-3-large",
    "api_key": "sk-your-openai-key"
  }
}

Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

Config

{
    "type": "rag",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"},
                "rag_answer_relevancy": {"type": "ragas"}
            }
        }
    }
}

Data Format

{
  "query": "What is the capital of France?",
  "retrieved_docs": [
    {"title": "France", "text": "Paris is the capital of France."}
  ],
  "reference": "Paris",
  "output": "Paris"
}

Result

{
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "rag_answer_relevancy": {
          "scores": {
            "answer_relevancy": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.9,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

Config

{
    "type": "rag",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"},
                "rag_answer_relevancy": {"type": "ragas"}
            }
        }
    }
}

Data Format

{
  "query": "Who wrote Les Misérables?",
  "retrieved_docs": [
    {"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."}
  ],
  "reference": "Victor Hugo",
  "output": "Victor Hugo"
}

Result

{
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "rag_answer_relevancy": {
          "scores": {
            "answer_relevancy": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.95,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Answer Evaluation (Pre-generated Answers)#

Config

{
    "type": "rag",
    "tasks": {
        "my-ragas-task": {
            "type": "ragas",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"}
            }
        }
    }
}

Data Format

{
  "query": "What is the main theme of Les Misérables?",
  "reference": "Redemption",
  "output": "Redemption"
}

Result

{
  "tasks": {
    "my-ragas-task": {
      "metrics": {
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.8,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

RAG (OpenAI-compatible Judge LLM)#

Config

{
    "type": "rag",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/v1/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "retriever_recall_5": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_5": {"type": "pytrec_eval"},
                "retriever_recall_10": {"type": "pytrec_eval"},
                "retriever_ndcg_cut_10": {"type": "pytrec_eval"},
                "rag_faithfulness": {"type": "ragas"},
                "rag_answer_relevancy": {"type": "ragas"}
            }
        }
    }
}

Data Format

{
  "query": "What is the population of Paris?",
  "retrieved_docs": [
    {"title": "Paris", "text": "The population of Paris is over 2 million."}
  ],
  "reference": "over 2 million",
  "output": "over 2 million"
}

Result

{
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "rag_answer_relevancy": {
          "scores": {
            "answer_relevancy": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "rag_faithfulness": {
          "scores": {
            "faithfulness": {
              "value": 1.0,
              "stats": {}
            }
          }
        },
        "retriever_retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.92,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

RAG (using RAGAS NVIDIA Metrics)#

Job v2

{
  "spec": {
    "target": {
      "type": "rag",
      "rag": {
        "pipeline": {
          "retriever": {
            "pipeline": {
              "query_embedding_model": {
                "api_endpoint": {
                  "url": "<my-query-embedding-url>",
                  "model_id": "<my-query-embedding-model>",
                  "format": "nim"
                }
              },
              "index_embedding_model": {
                "api_endpoint": {
                  "url": "<my-index-embedding-url>",
                  "model_id": "<my-index-embedding-model>",
                  "format": "nim"
                }
              },
              "top_k": 1
            }
          },
          "model": {
            "api_endpoint": {
              "url": "<my-model-url>",
              "model_id": "<my-model-id>",
              "format": "nim"
            }
          }
        }
      }
    },
    "config": {
      "type": "rag",
      "tasks": {
        "rag": {
          "type": "ragas",
          "dataset": {
            "files_url": "<my-dataset-url>",
            "format": "ragas"
          },
          "metrics": {
            "retriever_recall_5": {
              "type": "pytrec_eval"
            },
            "rag_answer_accuracy": {
              "type": "ragas"
            },
            "rag_context_relevance": {
              "type": "ragas"
            },
            "rag_response_groundedness": {
              "type": "ragas"
            }
          },
          "params": {
            "judge_llm": {
              "api_endpoint": {
                "url": "<my-judge-llm-url>",
                "model_id": "<my-judge-llm-model>"
              }
            },
            "judge_embeddings": {
              "api_endpoint": {
                "url": "<my-judge-embedding-url>",
                "model_id": "<my-judge-embedding-model>"
              }
            },
            "judge_timeout": 120,
            "judge_max_retries": 5,
            "judge_max_workers": 24,
            "judge_max_token": 2048,
            "judge_llm_top_p": 1.0
          }
        }
      }
    }
  }
}

Job v1

{
  "target": {
    "type": "rag",
    "rag": {
      "pipeline": {
        "retriever": {
          "pipeline": {
            "query_embedding_model": {
              "api_endpoint": {
                "url": "<my-query-embedding-url>",
                "model_id": "<my-query-embedding-model>",
                "format": "nim"
              }
            },
            "index_embedding_model": {
              "api_endpoint": {
                "url": "<my-index-embedding-url>",
                "model_id": "<my-index-embedding-model>",
                "format": "nim"
              }
            },
            "top_k": 1
          }
        },
        "model": {
          "api_endpoint": {
            "url": "<my-model-url>",
            "model_id": "<my-model-id>",
            "format": "nim"
          }
        }
      }
    }
  },
  "config": {
    "type": "rag",
    "tasks": {
      "rag": {
        "type": "ragas",
        "dataset": {
          "files_url": "<my-dataset-url>",
          "format": "ragas"
        },
        "metrics": {
           "retriever_recall_5": {
             "type": "pytrec_eval"
           },
           "rag_answer_accuracy": {
             "type": "ragas"
           },
           "rag_context_relevance": {
             "type": "ragas"
           },
           "rag_response_groundedness": {
             "type": "ragas"
           }
         },
        "params": {
          "judge_llm": {
            "api_endpoint": {
              "url": "<my-judge-llm-url>",
              "model_id": "<my-judge-llm-model>"
            }
          },
          "judge_embeddings": {
            "api_endpoint": {
              "url": "<my-judge-embedding-url>",
              "model_id": "<my-judge-embedding-model>"
            }
          },
          "judge_timeout": 120,
          "judge_max_retries": 5,
          "judge_max_workers": 24,
          "judge_max_token": 2048,
          "judge_llm_top_p": 1.0
        }
      }
    }
  }
}

Data Format

{
  "question": [
    "What are the key features of NVIDIA NIMs?"
  ],
  "contexts": [
    [
      "NVIDIA NIMs are containerized microservices that provide optimized inference for AI models."
    ]
  ],
  "ground_truth": [
    "NVIDIA NIMs offer containerized AI inference with optimized performance and scalability."
  ],
  "answer": [
    "NVIDIA NIMs are optimized containerized inference services for AI models."
  ]
}

Result

{
  "tasks": {
    "rag": {
      "metrics": {
        "rag_nv_accuracy": {
          "scores": {
            "nv_accuracy": {
              "value": 0.95,
              "stats": {}
            }
          }
        },
        "rag_nv_context_relevance": {
          "scores": {
            "nv_context_relevance": {
              "value": 0.92,
              "stats": {}
            }
          }
        },
        "rag_nv_response_groundedness": {
          "scores": {
            "nv_response_groundedness": {
              "value": 0.98,
              "stats": {}
            }
          }
        },
        "retriever_retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 1.0,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Metrics#

The RAG Pipeline evaluation includes two categories of metrics: document retrieval and answer generation.

Document Retrieval#

The following table summarizes the key document retrieval metrics available for RAG evaluation:

Metric Name	Description	How k is set	Notes
`retriever_recall_k`	Fraction of relevant documents retrieved in the top k results	User sets k (1 to top_k)	top_k is the Retriever’s configured value
`retriever_ndcg_k`	Normalized Discounted Cumulative Gain at rank k (ranking quality up to k)	User sets k (1 to top_k)	Range: 0.0 - 1.0
`retriever_ndcg_cut_k`	NDCG at rank k (cutoff variant, often equivalent to ndcg_k)	User sets k (1 to top_k)	Range: 0.0 - 1.0
`retriever_P_k`	Precision at rank k (fraction of retrieved documents that are relevant)	User sets k (1 to top_k)	Range: 0.0 - 1.0
`retriever_map_cut_k`	Mean Average Precision with cutoff at rank k	User sets k (1 to top_k)	Range: 0.0 - 1.0

Note

For an exhaustive list of supported retriever metrics, see Retriever Metrics. When using retrieval metrics in RAG spec, use the retriever_ prefix with pytrec_eval type. Retriever metrics are only computed when a retriever pipeline is specified in the RAG target configuration.

Answer Generation#

The following table summarizes the available answer generation metrics for RAG evaluation, including their requirements and dependencies:

Metric Name	Description	Dataset Format(s)	Required Columns	Eval Config Model Endpoints
`rag_faithfulness`	Factual consistency of the answer vs. context (0–1, higher is better)	beir, squad, ragas	question, answer, contexts	judge_llm
`rag_answer_relevancy`	Relevancy of answer to prompt; penalizes incomplete/redundant answers	beir, squad, ragas	question, answer	judge_llm, judge_embeddings
`rag_answer_correctness`	Accuracy vs. ground truth (0–1, higher is better)	ragas	question, answer, ground_truth	judge_llm, judge_embeddings
`rag_answer_similarity`	Semantic similarity to ground truth (0–1, higher is better)	ragas	ground_truth, answer	judge_llm, judge_embeddings
`rag_context_precision`	Precision of context ranking for ground-truth items (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`rag_context_recall`	Recall: does context align with ground-truth answer? (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`rag_answer_accuracy`	Agreement between model response and reference ground truth via dual LLM-as-a-judge evaluation (0, 1, 2; higher is better)	ragas	question, answer, ground_truth	judge_llm, judge_embeddings
`rag_context_relevance`	Evaluates whether retrieved contexts are pertinent to user input via dual LLM-as-a-judge assessment (0, 1, 2; higher is better)	ragas	question, contexts	judge_llm
`rag_response_groundedness`	Measures how well response claims are supported by retrieved contexts and can be found within them (0, 1, 2; higher is better)	ragas	question, answer, contexts	judge_llm
`rag_context_entity_recall`	Recall of entities in context compared to ground truth (0–1, higher is better)	ragas	question, contexts, ground_truth	judge_llm
`rag_noise_sensitivity`	Measures robustness to irrelevant context (0–1, lower is better)	ragas	question, answer, contexts	judge_llm
`rag_response_relevancy`	Overall relevancy of the response to the query (0–1, higher is better)	ragas	question, answer	judge_llm, judge_embeddings

Legend:

judge_llm: Metric uses a large language model as a judge.
judge_embeddings: Metric uses embedding-based similarity.

RAG Generation and Judge Parameters#

The following table summarizes the available parameters for configuring RAG answer generation and judge models:

Parameter Name	Type	Description	Default	Notes
Document Processing Parameters
`truncate_long_documents`	string	Truncation strategy for documents exceeding Milvus 65k character limit	None	Options: “start”, “end”
Generation Parameters
`generation_max_tokens`	int	Maximum number of tokens to generate for RAG responses	None	Controls response length
`generation_max_workers`	int	Maximum number of concurrent workers for answer generation	None	Controls generation parallelism
`generation_temperature`	float	Temperature for answer generation (0.0-1.0)	None	Higher values increase randomness
Judge LLM Parameters
`judge_llm`	object	Judge LLM model configuration with api_endpoint	Required	Contains model_id, url, api_key
`judge_llm_temperature`	float	Temperature for judge LLM (0.0-1.0)	None	Lower values for more consistent judging
`judge_llm_top_p`	float	Top-p sampling for judge LLM (0.0-1.0)	None	Controls token selection diversity
`judge_llm_max_tokens`	int	Maximum tokens for judge LLM responses	None	Limits judge response length
Judge Embeddings Parameters
`judge_embeddings`	object	Configuration for the embedding model used by judge (e.g., for answer_similarity)	Required	For similarity-based metrics
Request Management
`judge_timeout`	int	Timeout for judge requests (seconds)	None	Also mapped to `judge_request_timeout`
`judge_max_retries`	int	Maximum retries for failed judge requests	None	Error handling for judge calls
`judge_max_workers`	int	Maximum concurrent judge workers	None	Controls judge evaluation parallelism

Usage Example:

params:
  judge_llm:
    api_endpoint:
      url: "https://integrate.api.nvidia.com/v1"
      model_id: "nvdev/meta/llama-3.1-8b-instruct"
  judge_embeddings:
    api_endpoint:
      url: "https://integrate.api.nvidia.com/v1"
      model_id: "nvidia/nv-embedqa-e5-v5"
  judge_timeout: 120
  judge_max_retries: 5
  judge_max_workers: 2
  truncate_long_documents: "end"
  generation_max_tokens: 512
  generation_temperature: 0.8
  judge_llm_temperature: 0.1
  judge_llm_top_p: 0.9
  judge_llm_max_tokens: 1024

Custom Dataset Format#

BEIR#

The BEIR (Benchmarking Information Retrieval) framework supports various datasets for evaluating retrieval systems. Supported BEIR datasets include:

fiqa - Financial question answering dataset
nfcorpus - Natural language corpus for biomedical information retrieval
scidocs - Scientific document retrieval and citation recommendation
scifact - Scientific fact verification dataset

Note

For a complete list of available BEIR datasets, refer to the BEIR repository.

corpus.jsonl (BEIR)#

Fields

For BEIR, the corpus.jsonl file contains a list of dictionaries with the following fields:

Field	Type	Required	Description
`_id`	string	Yes	Unique document identifier.
`title`	string	No	Document title (optional).
`text`	string	Yes	Document paragraph or passage.

Example

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}

queries.jsonl (BEIR)#

Fields

The queries.jsonl file contains a list of dictionaries with the following fields:

Field	Type	Required	Description
`_id`	string	Yes	Unique query identifier.
`text`	string	Yes	Query text.

Example

{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}

qrels.tsv (BEIR)#

Fields

The qrels.tsv file is a tab-separated file with three columns: query-id, corpus-id, and score. The first row should be a header.

Field	Type	Required	Description
`query-id`	string	Yes	Query identifier (matches `_id` in queries.jsonl).
`corpus-id`	string	Yes	Document identifier (matches `_id` in corpus.jsonl).
`score`	integer	Yes	Relevance score (typically 1 for relevant, 0 for not relevant).

Example

query-id	corpus-id	score
q1	doc1	1

SQuAD#

squad.json (SQuAD)#

Fields

For SQuAD, the squad.json file contains question-answer pairs with their corresponding context passages in a structured format. It follows the official SQuAD dataset structure with nested fields for data, paragraphs, and question-answer pairs.

Field	Type	Required	Description
`data`	list of objects	Yes	List of data entries, each with paragraphs.
`paragraphs`	list of objects	Yes	List of paragraphs for each data entry.
`context`	string	Yes	Context passage for the questions.
`document_id`	string	Yes	Document identifier.
`qas`	list of objects	Yes	List of question-answer pairs.
`question`	string	Yes	The question being asked.
`id`	string	Yes	Unique identifier for the question.
`answers`	list of objects	Yes	List of answers, each with a `text` field.
`text`	string	Yes	The answer text (inside `answers`).

Example

{
   "data": [
      {
         "paragraphs": [
            {
               "context": "my context", 
               "document_id": "my id", 
               "qas": [
                  {
                     "question": "my question", 
                     "id": "my id", 
                     "answers": [
                        {"text": "my answer"}
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Ragas#

ragas.json (Ragas)#

Fields

For Ragas, the ragas.json file contains questions, contexts, answers, and ground truth for evaluating RAG systems. This format allows for comprehensive assessment of retrieval and generation quality.

Field	Type	Required	Description
`question`	list of strings	Yes	List of questions.
`contexts`	list of list of strings	No	List of context passages for each question.
`answer`	list of strings	No	List of predicted answers for each question.
`ground_truth`	list of strings	No	List of ground truth answers for each question.

Data format

{
    "question": ["question #1", "question #2"],
    # Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
    "contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "answer": ["predicted answer for question #1", "predicted answer for question #2"],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "ground_truth": ["ground truth answer for question #1", "ground truth answer for question #2"]
}

Example

{
    "question": [
        "When did the 2024 Paris Olympics opening ceremony take place?",
        "Where was the 2024 Paris Olympics opening ceremony held?",
        "Who lit the Olympic cauldron during the 2024 Paris Olympics opening ceremony?"
    ],
    "contexts": [
        [
            "The 2024 Paris Olympics officially began with the opening ceremony on July 26, 2024.",
            "The ceremony was staged along the River Seine in Paris, marking the first time an Olympic opening was held outside a traditional stadium.",
            "French swimmer Marie Wattel had the honor of lighting the Olympic cauldron."
        ],
        [
            "The 2024 Paris Olympics officially began with the opening ceremony on July 26, 2024.",
            "The ceremony was staged along the River Seine in Paris, marking the first time an Olympic opening was held outside a traditional stadium.",
            "French swimmer Marie Wattel had the honor of lighting the Olympic cauldron."
        ],
        [
            "The 2024 Paris Olympics officially began with the opening ceremony on July 26, 2024.",
            "The ceremony was staged along the River Seine in Paris, marking the first time an Olympic opening was held outside a traditional stadium.",
            "French swimmer Marie Wattel had the honor of lighting the Olympic cauldron."
        ]
    ],
    "ground_truth": [
        "July 26, 2024",
        "Along the River Seine, Paris",
        "French swimmer Marie Wattel"
    ]
}