Retrieval Evaluation Flow#

Retriever evaluation types are designed to measure the effectiveness of document retrieval pipelines on standard academic datasets and custom datasets. Use this evaluation type to assess retrieval accuracy using metrics such as recall@k and NDCG@k.

Prerequisites#

Before running Retriever evaluations, ensure you have:

  • Access to embedding models for document indexing and query processing

  • Optional reranking service endpoints (for improved retrieval accuracy)

  • Properly configured retrieval pipeline components

Custom Datasets

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.


Authentication for External Services#

Use API keys to authenticate to external embedding and reranking providers (such as OpenAI and Cohere). Common uses include querying embeddings, indexing embeddings, and reranking.

Add api_key to an api_endpoint configuration:

{
  "api_endpoint": {
    "url": "https://api.cohere.ai/v1/rerank",
    "model_id": "rerank-english-v2.0",
    "api_key": "your-cohere-key"
  }
}

Refer to API Key Authentication for configuration examples and security best practices.


Embedding + Reranking (Standard Data)#

{
    "type": "retriever",
    "name": "retriever-standard",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}
{
  "query": "What is the capital of France?",
  "retrieved_docs": [
    {"title": "France", "text": "Paris is the capital of France."},
    {"title": "Paris", "text": "Paris is a city in France."}
  ],
  "reference": "Paris"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.9},
            "recall_10": {"value": 1.0},
            "ndcg_cut_10": {"value": 0.85}
          }
        }
      }
    }
  }
}

Embedding + Reranking (Custom Data)#

{
    "type": "retriever",
    "name": "retriever-custom",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}
{
  "query": "Who wrote Les Misérables?",
  "retrieved_docs": [
    {"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."},
    {"title": "Victor Hugo", "text": "Victor Hugo was a French writer."}
  ],
  "reference": "Victor Hugo"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.95},
            "recall_10": {"value": 1.0},
            "ndcg_cut_10": {"value": 0.9}
          }
        }
      }
    }
  }
}

Metrics#

Supported Retriever Metrics#

Metric

Description

Value Range

Example

recall_k

Fraction of relevant documents retrieved in the top k results

0.0 – 1.0

recall_5, recall_10

ndcg_k

Normalized Discounted Cumulative Gain at rank k (ranking quality up to k)

0.0 – 1.0

ndcg_5, ndcg_10

ndcg_cut_k

NDCG at rank k (cutoff variant, often equivalent to ndcg_k)

0.0 – 1.0

ndcg_cut_5, ndcg_cut_10

P_k

Precision at rank k (fraction of retrieved documents that are relevant)

0.0 – 1.0

P_5, P_10

map

Mean Average Precision (average of precision values at each relevant document)

0.0 – 1.0

map

recip_rank

Mean Reciprocal Rank (average of reciprocal ranks of first relevant documents)

0.0 – 1.0

recip_rank

bpref

Binary Preference (preference of relevant over non-relevant documents)

0.0 – 1.0

bpref

Rprec

R-Precision (precision at R where R is the number of relevant documents)

0.0 – 1.0

Rprec

map_cut_k

Mean Average Precision with cutoff at rank k

0.0 – 1.0

map_cut_5, map_cut_10

Note

Additional metrics are also supported, including success_k, relative_P_k, ndcg_rel, gm_map, gm_bpref, infAP, Rprec_mult_X, iprec_at_recall_X, and various statistical metrics supported by the TREC evaluation library (num_q, num_ret, num_rel, num_rel_ret, etc.).

Custom Dataset Format#

Refer to RAG’s Custom Dataset Format documentation.

Parameters#

truncate_long_documents (string, optional): Truncation strategy for documents exceeding Milvus 65k character limit. Options are “start” or “end” to truncate from the beginning or end of the document respectively.

Usage Example:

{
    "type": "retriever",
    "tasks": {
        "my-retriever-task": {
            "type": "beir",
            "params": {
                "truncate_long_documents": "end"
            },
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"}
            }
        }
    }
}