Retrieval Evaluation Flow#

Retriever evaluation types are designed to measure the effectiveness of document retrieval pipelines on standard academic datasets and custom datasets. Use this evaluation type to assess retrieval accuracy using metrics such as recall@k and NDCG@k.

Prerequisites#

Before running Retriever evaluations, ensure you have:

Access to embedding models for document indexing and query processing
Optional reranking service endpoints (for improved retrieval accuracy)
Properly configured retrieval pipeline components

Custom Datasets

Upload your dataset to NeMo Data Store using Hugging Face CLI or SDK
Register your dataset in NeMo Entity Store using the Dataset APIs
Format your data according to the BEIR format requirements

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "dataset",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            }
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "dataset",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "dataset",
        "name": "my-target-dataset-1",
        "namespace": "my-organization",
        "dataset": {
            "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
        }
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "dataset",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Authentication for External Services#

Use API keys to authenticate to external embedding and reranking providers (such as OpenAI and Cohere). Common uses include querying embeddings, indexing embeddings, and reranking.

Add api_key to an api_endpoint configuration:

{
  "api_endpoint": {
    "url": "https://api.cohere.ai/v1/rerank",
    "model_id": "rerank-english-v2.0",
    "api_key": "your-cohere-key"
  }
}

Refer to API Key Authentication for configuration examples and security best practices.

Embedding + Reranking (Standard Data)#

Config

{
    "type": "retriever",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

Data Format

{
  "query": "What is the capital of France?",
  "retrieved_docs": [
    {"title": "France", "text": "Paris is the capital of France."},
    {"title": "Paris", "text": "Paris is a city in France."}
  ],
  "reference": "Paris"
}

Result

{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.9},
            "recall_10": {"value": 1.0},
            "ndcg_cut_10": {"value": 0.85}
          }
        }
      }
    }
  }
}

Embedding + Reranking (Custom Data)#

Config

{
    "type": "retriever",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

Data Format

{
  "query": "Who wrote Les Misérables?",
  "retrieved_docs": [
    {"title": "Les Misérables", "text": "Victor Hugo wrote Les Misérables."},
    {"title": "Victor Hugo", "text": "Victor Hugo was a French writer."}
  ],
  "reference": "Victor Hugo"
}

Result

{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "recall_5": {"value": 1.0},
            "ndcg_cut_5": {"value": 0.95},
            "recall_10": {"value": 1.0},
            "ndcg_cut_10": {"value": 0.9}
          }
        }
      }
    }
  }
}

Metrics#

Supported Retriever Metrics#
Metric	Description	Value Range	Example
`recall_k`	Fraction of relevant documents retrieved in the top k results	0.0 – 1.0	`recall_5`, `recall_10`
`ndcg_k`	Normalized Discounted Cumulative Gain at rank k (ranking quality up to k)	0.0 – 1.0	`ndcg_5`, `ndcg_10`
`ndcg_cut_k`	NDCG at rank k (cutoff variant, often equivalent to `ndcg_k`)	0.0 – 1.0	`ndcg_cut_5`, `ndcg_cut_10`
`P_k`	Precision at rank k (fraction of retrieved documents that are relevant)	0.0 – 1.0	`P_5`, `P_10`
`map`	Mean Average Precision (average of precision values at each relevant document)	0.0 – 1.0	`map`
`recip_rank`	Mean Reciprocal Rank (average of reciprocal ranks of first relevant documents)	0.0 – 1.0	`recip_rank`
`bpref`	Binary Preference (preference of relevant over non-relevant documents)	0.0 – 1.0	`bpref`
`Rprec`	R-Precision (precision at R where R is the number of relevant documents)	0.0 – 1.0	`Rprec`
`map_cut_k`	Mean Average Precision with cutoff at rank k	0.0 – 1.0	`map_cut_5`, `map_cut_10`

Note

Additional metrics are also supported, including success_k, relative_P_k, ndcg_rel, gm_map, gm_bpref, infAP, Rprec_mult_X, iprec_at_recall_X, and various statistical metrics supported by the TREC evaluation library (num_q, num_ret, num_rel, num_rel_ret, etc.).

Custom Dataset Format#

Refer to RAG’s Custom Dataset Format documentation.

Parameters#

truncate_long_documents (string, optional): Truncation strategy for documents exceeding Milvus 65k character limit. Options are “start” or “end” to truncate from the beginning or end of the document respectively.

Usage Example:

{
    "type": "retriever",
    "tasks": {
        "my-retriever-task": {
            "type": "beir",
            "params": {
                "truncate_long_documents": "end"
            },
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"}
            }
        }
    }
}