Retriever Evaluation Type#

Retriever evaluation types are designed to measure the effectiveness of document retrieval pipelines on standard academic datasets and custom datasets. Use this evaluation type to assess retrieval accuracy using metrics such as recall@k and NDCG@k.

Prerequisites#

Before running Retriever evaluations, ensure you have:

For custom datasets:

Uploaded your dataset to NeMo Data Store using Hugging Face CLI or SDK
Registered your dataset in NeMo Entity Store using the Dataset APIs
Formatted your data according to the BEIR format requirements

For all Retriever evaluations:

Access to embedding models for document indexing and query processing
Optional reranking service endpoints (for improved retrieval accuracy)
Properly configured retrieval pipeline components

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.

Authentication for External Services#

Retriever evaluations support API key authentication for external embedding and reranking services. This enables secure integration with third-party APIs like OpenAI, Cohere, and other providers.

Tip

For comprehensive authentication configuration examples and security best practices, refer to API Key Authentication.

Common Authentication Scenarios#

External query embedding models
External index embedding models
Third-party reranking services

Add the api_key field to any api_endpoint configuration:

{
  "api_endpoint": {
    "url": "https://api.cohere.ai/v1/rerank",
    "model_id": "rerank-english-v2.0",
    "api_key": "your-cohere-key"
  }
}

Options#

Embedding + Reranking (Standard Data)#

Config

{
    "type": "retriever",
    "name": "retriever-standard",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

Data Format

{
  "query": "What is the capital of France?",
  "retrieved_docs": [
    {"title": "France", "text": "Paris is the capital of France."},
    {"title": "Paris", "text": "Paris is a city in France."}
  ],
  "reference": "Paris"
}

Result

{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.9},
            "recall_10": {"value": 1.0},
            "ndcg_cut_10": {"value": 0.85}
          }
        }
      }
    }
  }
}

Embedding + Reranking (Custom Data)#

Config

{
    "type": "retriever",
    "name": "retriever-custom",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

Data Format

{
  "query": "Who wrote Les Misérables?",
  "retrieved_docs": [
    {"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."},
    {"title": "Victor Hugo", "text": "Victor Hugo was a French writer."}
  ],
  "reference": "Victor Hugo"
}

Result

{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.95},
            "recall_10": {"value": 1.0},
            "ndcg_cut_10": {"value": 0.9}
          }
        }
      }
    }
  }
}

Metrics#

Supported Retriever Metrics#
Metric	Description	Value Range	Example
`recall_k`	Fraction of relevant documents retrieved in the top k results	0.0 – 1.0	`recall_5`, `recall_10`
`ndcg_k`	Normalized Discounted Cumulative Gain at rank k (ranking quality up to k)	0.0 – 1.0	`ndcg_5`, `ndcg_10`
`ndcg_cut_k`	NDCG at rank k (cutoff variant, often equivalent to `ndcg_k`)	0.0 – 1.0	`ndcg_cut_5`, `ndcg_cut_10`

Custom Dataset Format#

Refer to RAG’s Custom Dataset Format documentation.