Create and Manage Evaluation Configurations#
When you run an evaluation in NVIDIA NeMo Evaluator, you create a separate target and configuration for the evaluation.
Tip
Because NeMo Evaluator separates the target and the configuration, you can create a configuration once, and reuse it multiple times with different targets (for example, to compare models). To see what targets and configurations are supported together, refer to Combine Evaluation Targets and Configurations.
Evaluator API URL#
To create a configuration for an evaluation, send a POST
request to the /evaluation/configs
API.
The URL of the evaluator API depends on where you deploy evaluator and how you configure it.
For more information, refer to NeMo Evaluator Deployment Guide.
The examples in this documentation specify {EVALUATOR_HOSTNAME}
in the code.
Do the following to store the evaluator hostname to use it in your code.
Important
Replace <your evaluator service endpoint>
with your address, such as evaluator.internal.your-company.com
, before you run this code.
export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
import requests
EVALUATOR_HOSTNAME = "<your evaluator service endpoint>"
Example Config#
The following is the partial structure of the code to create an evaluation configuration. Use the rest of this documentation to see examples and reference to create a config specific to your scenario.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "<evaluation-type>",
"name": "<my-configuration-name>",
"namespace": "<my-namespace>",
// More config details
}'
data = {
"type": "<evaluation-type>",
"name": "<my-configuration-name>",
"namespace": "<my-namespace>",
// More config details
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
To see a sample response, refer to Create Config Response.
Configuration JSON Reference#
When you create a configuration for an evaluation, you send a JSON data structure that contains the information for your configuration.
Important
Each configuration is uniquely identified by a combination of namespace
and name
. For example my-organization/my-configuration
.
The following table contains selected field reference for the JSON data. For the full API reference, refer to Evaluator API.
Name |
Description |
Type |
Valid Values or Child Objects |
---|---|---|---|
access_policies |
The policies that control who can use the configuration. This field is for sharing configurations across organizations. |
Object |
— |
api_endpoint |
The endpoint for a model. |
Object |
- |
api_key |
The key to access an API endpoint. |
String |
— |
created_by |
The ID of the user that created the configuration. This field is for sharing configurations across organizations. |
String |
— |
custom_fields |
An optional object that you can use to store additional information. |
Object |
— |
dataset |
A dataset to use for the evaluation. |
Object |
- |
description |
A description of the configuration. |
String |
— |
extra |
Additional parameters for academic benchmarks. |
Object |
— |
files_url |
The url for a file that contains pre-generated data. Use |
String |
— |
format |
The format of a data file. For format information, refer to Custom Data. |
String |
- |
groups |
A dictionary of evaluation tasks to run in a group. |
Object |
- |
hf_token |
A Hugging Face account token. For some benchmark datasets, a valid Hugging Face token is required to access the data. For example, task |
String |
— |
id |
The ID of the configuration. The ID is returned in the response when you create a configuration. |
String |
— |
judge_llm |
The model to use to judge the answer. |
Object |
- |
judge_model |
The model to use to judge the answer. |
Object |
- |
limit_samples |
The number of samples to evaluate. |
Integer |
— |
max_tokens |
The maximum number of tokens to generate during inference. |
Integer |
— |
max_retries |
The number of times an evaluation job retries a request to a model after a failure. |
Integer |
— |
metrics |
A dictionary of objects in the form |
Object |
- |
model_id |
The id of the NIM model, as specified in Models. |
String |
— |
name |
An arbitrary name for to identify the configuration. If you don’t specify a name, the default is the ID associated with the configuration. |
String |
— |
namespace |
An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is |
String |
— |
ownership |
Information about the creator of the configuration, and who can use it. This field is for sharing configurations across organizations. |
Object |
- |
parallelism |
The parallelism of job running the benchmark. Supported by |
Integer |
— |
params |
A set of parameters to apply to the evaluation. |
Object |
- |
project |
The ID of a project to associate with the configuration. |
String |
— |
request_timeout |
The time in milliseconds that the evaluation job waits for a response from the model before it fails. |
Integer |
— |
stop |
Up to 4 sequences where the API will stop generating further tokens. |
String or List |
— |
tasks |
A dictionary of evaluation tasks to run. |
Object |
- |
temperature |
Adjusts the randomness of token selection. Higher values increase randomness and creativity; lower values promote deterministic and conservative output. |
Number |
— |
top_p |
A threshold that selects from the most probable tokens until the cumulative probability exceeds p. |
Number |
— |
type |
The type of evaluation that the configuration is for. For custom evaluations, set this to |
String |
Some examples include: |
type (task) |
The type of a task. |
String |
Some examples include: |
url |
The url for a model endpoint. |
String |
— |
LM Harness Extra Parameters#
You can set the following task-specific parameters in the params.extra
section of an LM Harness config.
Name |
Description |
Type |
Valid Values or Child Objects |
---|---|---|---|
apply_chat_template |
Specify whether to apply a chat template to the prompt. You can specify |
Boolean or String |
|
batch_size |
The batch size for the model. |
Integer |
— |
bootstrap_iters |
The number of iterations for bootstrap statistics when calculating stderrs. Specify 0 for no stderr calculations. |
Integer |
— |
dataset_seed |
A random seed for dataset shuffling. |
Integer |
— |
fewshot_as_multiturn |
|
Boolean |
|
hf_token |
A Hugging Face account token to access tokenizers that require authenticated or authorized access. |
String |
— |
num_fewshot |
The number of examples in few-shot context. |
Integer |
— |
seed |
A random seed for Python’s random, numpy and torch. Accepts a comma-separated list of 3 values for Python’s random, numpy, and torch seeds, respectively. Specify a single integer to set the same seed for all three, for example |
— |
— |
tokenizer |
A path to the custom tokenizer to use as a benchmark. |
String |
— |
tokenizer_backend |
A backend store to use for loading tokenizer. |
String |
|
BigCode Configurations#
BigCode Evaluation Harness is a framework for the evaluation of code generation models. For more information, refer to BigCode Evaluation Harness.
Use the following code to create a configuration for a BigCode evaluation.
For the type of evaluation, specify the BigCode task that you want to run. For the full list of BigCode tasks, refer to tasks.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "humaneval",
"name": "my-configuration-bigcode-humaneval-1",
"namespace": "my-organization",
"params": {
"parallelism": 1,
"limit_samples": 1,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.0,
"extra": {
"batch_size": 1,
"top_k": 1
}
}
}'
data = {
"type": "humaneval",
"name": "my-configuration-bigcode-humaneval-1",
"namespace": "my-organization",
"params": {
"parallelism": 1,
"limit_samples": 1,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.0,
"extra": {
"batch_size": 1,
"top_k": 1
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
LM Evaluation Harness Configurations#
LM Evaluation Harness
supports over 60 standard academic benchmarks for LLMs, including MMLU
, GSM8K
, and hellaswag
.
For more information, refer to LM Evaluation Harness.
Use the following code to create a configuration for an LM Evaluation Harness evaluation.
For the type of evaluation, specify the LM Evaluation Harness task that you want to run. For the full list of tasks, refer to tasks.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "gpqa",
"name": "my-configuration-lm-harness-gpqa-1",
"namespace": "my-organization",
"tasks": {
"gpqa_diamond_generative_n_shot": {
"type": "gpqa_diamond_generative_n_shot"
}
},
"params": {
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.0,
"stop": [
"<|endoftext|>",
"<extra_id_1>"
],
"extra": {
"use_greedy": true,
"top_k": 1
}
}
}'
data = {
"type": "gpqa",
"name": "my-configuration-lm-harness-gpqa-1",
"namespace": "my-organization",
"tasks": {
"gpqa_diamond_generative_n_shot": {
"type": "gpqa_diamond_generative_n_shot"
}
},
"params": {
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.0,
"stop": [
"<|endoftext|>",
"<extra_id_1>"
],
"extra": {
"use_greedy": True,
"top_k": 1
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
You can set task-specific parameters in the params.extra
section of the config as shown in the following example.
For more information, refer to LM Harness Extra Parameters.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "gsm8k",
"name": "my-configuration-lm-harness-gsm8k-1",
"namespace": "my-organization",
"tasks": {
"gsm8k_cot_llama": {
"type": "gsm8k_cot_llama"
}
},
"params": {
"temperature": 0.00001,
"top_p": 0.00001,
"max_tokens": 256,
"stop": ["<|eot|>"],
"extra": {
"num_fewshot": 8,
"batch_size": 16,
"bootstrap_iters": 100000,
"dataset_seed": 42,
"use_greedy": true,
"top_k": 1,
"hf_token": "<my-token>",
"tokenizer_backend": "hf",
"tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
"apply_chat_template": true,
"fewshot_as_multiturn": true
}
}
}'
data = {
"type": "gsm8k",
"name": "my-configuration-lm-harness-gsm8k-1",
"namespace": "my-organization",
"tasks": {
"gsm8k_cot_llama": {
"type": "gsm8k_cot_llama"
}
},
"params": {
"temperature": 0.00001,
"top_p": 0.00001,
"max_tokens": 256,
"stop": ["<|eot|>"],
"extra": {
"num_fewshot": 8,
"batch_size": 16,
"bootstrap_iters": 100000,
"dataset_seed": 42,
"use_greedy": True,
"top_k": 1,
"hf_token": "<my-token>",
"tokenizer_backend": "hf",
"tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
"apply_chat_template": True,
"fewshot_as_multiturn": True
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Similarity Metrics Configurations#
Similarity Metrics evaluation enables evaluating a model on custom datasets, by comparing the LLM generated response with a ground truth response. For more information, refer to Similarity Metrics.
Use the following code to create a configuration for a Similarity Metrics evaluation. For more information about custom data, refer to Use Custom Data with NVIDIA NeMo Evaluator.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "similarity_metrics",
"name": "my-configuration-similarity-1",
"namespace": "my-organization",
"params": {
"max_tokens": 200,
"temperature": 0.7,
"extra": {
"top_k": 20
}
},
"tasks": {
"my-similarity-metrics-task": {
"type": "default",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"metrics": {
"accuracy": {"type": "accuracy"},
"bleu": {"type": "bleu"},
"rouge": {"type": "rouge"},
"em": {"type": "em"},
"f1": {"type": "f1"}
}
}
}
}'
data = {
"type": "similarity_metrics",
"name": "my-configuration-similarity-1",
"namespace": "my-organization",
"params": {
"max_tokens": 200,
"temperature": 0.7,
"extra": {
"top_k": 20
}
},
"tasks": {
"my-similarity-metrics-task": {
"type": "default",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"metrics": {
"accuracy": {"type": "accuracy"},
"bleu": {"type": "bleu"},
"rouge": {"type": "rouge"},
"em": {"type": "em"},
"f1": {"type": "f1"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
LLM-as-a-Judge Configurations#
With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. For more information, refer to LLM-as-a-Judge.
Example Configuration for LLM-as-a-Judge (Standard MT-Bench Data)#
Use the following code to create a configuration for an LLM-as-a-Judge evaluation that uses the standard MT-Bench data.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "mt_bench",
"name": "my-configuration-judge-1",
"namespace": "my-organization",
"params": {
"max_tokens": 1024,
"temperature": 0.75,
"top_p": 0.9,
"stop": [],
"extra": {
"top_k": 40
}
},
"tasks": {
"my-mt-bench": {
"type": "mt_bench",
"params": {
"judge_model": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>"
}
},
"judge_inference_params": {
"max_tokens": 2048,
"temperature": 1.0e-05,
"top_p": 1.0e-05,
"stop": [],
"top_k": 1
}
}
}
}
}'
data = {
"type": "mt_bench",
"name": "my-configuration-judge-1",
"namespace": "my-organization",
"params": {
"max_tokens": 1024,
"temperature": 0.75,
"top_p": 0.9,
"stop": [],
"extra": {
"top_k": 40
}
},
"tasks": {
"my-mt-bench": {
"type": "mt_bench",
"params": {
"judge_model": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>"
}
},
"judge_inference_params": {
"max_tokens": 2048,
"temperature": 1.0e-05,
"top_p": 1.0e-05,
"stop": [],
"top_k": 1
}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Example Configuration for LLM-as-a-Judge (OpenAI-compatible Judge LLM)#
Use the following code to create a configuration for an LLM-as-a-Judge evaluation that uses an OpenAI-compatible Judge LLM.
To provide credentials for authenticating with an OpenAI-compatible Judge LLM, include an api_key
in the judge_model.api_endpoint
field.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "mt_bench",
"name": "my-configuration-judge-2",
"namespace": "my-organization",
"params": {
"max_tokens": 1024,
"temperature": 0.75,
"top_p": 0.9,
"stop": [],
"extra": {
"top_k": 40
}
},
"tasks": {
"mt_bench": {
"type": "mt_bench",
"params": {
"judge_model": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>",
"api_key": "<openai-api-key>"
}
},
"judge_inference_params": {
"max_tokens": 2048,
"temperature": 1.0e-05,
"top_p": 1.0e-05,
"stop": [],
"top_k": 1
}
}
}
}
}'
data = {
"type": "mt_bench",
"name": "my-configuration-judge-2",
"namespace": "my-organization",
"params": {
"max_tokens": 1024,
"temperature": 0.75,
"top_p": 0.9,
"stop": [],
"extra": {
"top_k": 40
}
},
"tasks": {
"mt_bench": {
"type": "mt_bench",
"params": {
"judge_model": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>",
"api_key": "<openai-api-key>"
}
},
"judge_inference_params": {
"max_tokens": 2048,
"temperature": 1.0e-05,
"top_p": 1.0e-05,
"stop": [],
"top_k": 1
}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Retriever Pipeline Configurations#
NeMo Evaluator provides support for evaluation of retriever pipelines on standard academic datasets and custom datasets. For more information, refer to Retriever Pipelines.
Example Configuration for Embedding + Reranking (Standard Data)#
Use the following code to create a configuration for a retriever evaluation, with embedding + reranking, that uses standard data.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "retriever",
"name": "my-configuration-retriever-1",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://fiqa/"
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"}
}
}
}
}'
data = {
"type": "retriever",
"name": "my-configuration-retriever-1",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://fiqa/"
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Example Configuration for Embedding + Reranking (Custom Data)#
Use the following code to create a configuration for a retriever evaluation, with embedding + reranking, that uses custom data.
This example specifies data that is in the BEIR format. You can also use data in the SQuAD format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "retriever",
"name": "my-configuration-retriever-2",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"}
}
}
}
}'
data = {
"type": "retriever",
"name": "my-configuration-retriever-2",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
RAG Pipeline Configurations#
NeMo Evaluator provides support for evaluating RAG pipelines, which are built by pipelining NeMo Retriever and LLM. For more information, refer to RAG Pipelines.
Example Configuration for Retrieval + Answer Generation + Answer Evaluation (Standard Data)#
Use the following code to create a configuration for a Retrieval + Answer Generation + Answer Evaluation pipeline with standard data.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "rag",
"name": "my-configuration-rag-1",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://nfcorpus/"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-judge-embedding-url>",
"model_id": "<my-judge-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}'
data = {
"type": "rag",
"name": "my-configuration-rag-1",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://nfcorpus/"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-judge-embedding-url>",
"model_id": "<my-judge-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Example Configuration for Retrieval + Answer Generation + Answer Evaluation (Custom Data)#
Use the following code to create a configuration for a Retrieval + Answer Generation + Answer Evaluation pipeline with custom data.
This example specifies data that is in the BEIR format. You can also use data in the SQuAD format or the Ragas format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "rag",
"name": "my-configuration-rag-2",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-judge-embedding-url>",
"model_id": "<my-judge-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}'
data = {
"type": "rag",
"name": "my-configuration-rag-2",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-judge-embedding-url>",
"model_id": "<my-judge-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Example Configuration for Answer Evaluation (Pre-generated Answers)#
Use the following code to create a configuration for an answer evaluation with custom pre-generated answers.
This example specifies data that is in the Ragas format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "rag",
"name": "my-configuration-rag-3",
"namespace": "my-organization",
"tasks": {
"my-ragas-task": {
"type": "ragas",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-query-embedding-url>",
"model_id": "<my-query-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"}
}
}
}
}'
data = {
"type": "rag",
"name": "my-configuration-rag-3",
"namespace": "my-organization",
"tasks": {
"my-ragas-task": {
"type": "ragas",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-judge-llm-url>",
"model_id": "<my-judge-llm-model>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-query-embedding-url>",
"model_id": "<my-query-embedding-model>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Example Configuration for RAG (OpenAI-compatible Judge LLM)#
Use the following code to create a configuration for a RAG pipeline evaluation that uses an OpenAI-compatible Judge LLM.
To provide credentials for authenticating with an OpenAI-compatible Judge LLM, include an api_key
in the judge_llm
field.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"type": "rag",
"name": "my-configuration-rag-4",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://nfcorpus/"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>",
"api_key": "<openai-api-key>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-query-embedding-url>",
"model_id": "<my-query-embedding-model>",
"api_key": "<openai-api-key>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}'
data = {
"type": "rag",
"name": "my-configuration-rag-4",
"namespace": "my-organization",
"tasks": {
"my-beir-task": {
"type": "beir",
"dataset": {
"files_url": "file://nfcorpus/"
},
"params": {
"judge_llm": {
"api_endpoint": {
"url": "<my-nim-deployment-base-url>/completions",
"model_id": "<my-model>",
"api_key": "<openai-api-key>"
}
},
"judge_embeddings": {
"api_endpoint": {
"url": "<my-query-embedding-url>",
"model_id": "<my-query-embedding-model>",
"api_key": "<openai-api-key>"
}
},
"judge_timeout": 300,
"judge_max_retries": 5,
"judge_max_workers": 16
},
"metrics": {
"recall_5": {"type": "recall_5"},
"ndcg_cut_5": {"type": "ndcg_cut_5"},
"recall_10": {"type": "recall_10"},
"ndcg_cut_10": {"type": "ndcg_cut_10"},
"faithfulness": {"type": "faithfulness"},
"answer_relevancy": {"type": "answer_relevancy"}
}
}
}
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"
response = requests.post(endpoint, json=data).json()
Delete a Config#
To delete an evaluation configuration, send a DELETE
request to the configs endpoint.
You must provide both the namespace and ID of the config as shown in the following code.
Caution
Before you delete a config, ensure that no jobs use it. If a job uses the config, you must delete the job first. To find all jobs that use a config, refer to Example: Filter Jobs by Config.
curl -X "DELETE" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs/<my-namespace>/<my-config-id>" \
-H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs/<my-namespace>/<my-config-id>"
response = requests.delete(endpoint).json()
response
When you delete a config, the response is similar to the following.
{
"message": "Resource deleted successfully.",
"id": "eval-config-MNOP1234QRST5678",
"deleted_at": null
}
Create Config Response#
When you create a configuration for an evaluation, the response is similar to the following.
For the full response reference, refer to Evaluator API.
{
"created_at": "2025-03-19T22:50:02.206136",
"updated_at": "2025-03-19T22:50:02.206138",
"id": "eval-config-MNOP1234QRST5678",
"name": "my-configuration-lm-harness-gsm8k-1",
"namespace": "my-organization",
"type": "gsm8k",
"params": {
"temperature": 0.00001,
"top_p": 0.00001,
"max_tokens": 256,
"stop": ["<|eot|>"],
"extra": {
"num_fewshot": 8,
"batch_size": 16,
"bootstrap_iters": 100000,
"dataset_seed": 42,
"use_greedy": true,
"top_k": 1,
"hf_token": "<my-token>",
"tokenizer_backend": "hf",
"tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
"apply_chat_template": true,
"fewshot_as_multiturn": true
}
},
"custom_fields": {}
}