Create and Manage Evaluation Configurations#

When you run an evaluation in NVIDIA NeMo Evaluator, you create a separate target and configuration for the evaluation.

Tip

Because NeMo Evaluator separates the target and the configuration, you can create a configuration once, and reuse it multiple times with different targets (for example, to compare models). To see what targets and configurations are supported together, refer to Combine Evaluation Targets and Configurations.

Evaluator API URL#

To create a configuration for an evaluation, send a POST request to the /evaluation/configs API. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. For more information, refer to NeMo Evaluator Deployment Guide.

The examples in this documentation specify {EVALUATOR_HOSTNAME} in the code. Do the following to store the evaluator hostname to use it in your code.

Important

Replace <your evaluator service endpoint> with your address, such as evaluator.internal.your-company.com, before you run this code.

curl

export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"

Python

import requests

EVALUATOR_HOSTNAME = "<your evaluator service endpoint>" 

Example Config#

The following is the partial structure of the code to create an evaluation configuration. Use the rest of this documentation to see examples and reference to create a config specific to your scenario.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "<evaluation-type>",
        "name": "<my-configuration-name>",
        "namespace": "<my-namespace>",

        // More config details
    }'

Python

data = {
    "type": "<evaluation-type>",
    "name": "<my-configuration-name>",
    "namespace": "<my-namespace>",

    // More config details
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

To see a sample response, refer to Create Config Response.

Configuration JSON Reference#

When you create a configuration for an evaluation, you send a JSON data structure that contains the information for your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

The following table contains selected field reference for the JSON data. For the full API reference, refer to Evaluator API.

Name	Description	Type	Valid Values or Child Objects
access_policies	The policies that control who can use the configuration. This field is for sharing configurations across organizations.	Object	—
api_endpoint	The endpoint for a model.	Object	- `url` - `model_id` - `api_key`
api_key	The key to access an API endpoint.	String	—
created_by	The ID of the user that created the configuration. This field is for sharing configurations across organizations.	String	—
custom_fields	An optional object that you can use to store additional information.	Object	—
dataset	A dataset to use for the evaluation.	Object	- `files_url` - `format`
description	A description of the configuration.	String	—
extra	Additional parameters for academic benchmarks.	Object	—
files_url	The url for a file that contains pre-generated data. Use `hf://` as prefix for files stored in NeMo Data Store, for example `hf://datasets/<your-dataset-name>/<your-file-name>`. The value can also be a name of a standard dataset, such as `fiqa` and `nfcorpus` - refer to those using `file://` prefix, for example `file://fiqa/`. For format information, refer to Custom Data.	String	—
format	The format of a data file. For format information, refer to Custom Data.	String	- `beir` - `squad` - `ragas`
groups	A dictionary of evaluation tasks to run in a group.	Object	- `type` - `params` - `dataset` - `metrics`
hf_token	A Hugging Face account token. For some benchmark datasets, a valid Hugging Face token is required to access the data. For example, task `gpqa` requires access to dataset. Other benchmark tasks don’t require a token, but setting the token can help avoid rate limiting constraints by switching to account-based rate limits.	String	—
id	The ID of the configuration. The ID is returned in the response when you create a configuration.	String	—
judge_llm	The model to use to judge the answer.	Object	- `api_endpoint`
judge_model	The model to use to judge the answer.	Object	- `api_endpoint`
limit_samples	The number of samples to evaluate.	Integer	—
max_tokens	The maximum number of tokens to generate during inference.	Integer	—
max_retries	The number of times an evaluation job retries a request to a model after a failure.	Integer	—
metrics	A dictionary of objects in the form `{"name": "<metric-name>"}`, where metric-name is a metric to use for the evaluation. Depending on the metric, you might need to provide a ground truth in the file `input.json`. For format information, refer to Custom Data.	Object	- `accuracy` - `answer_relevancy` - `bleu` - `em` - `f1` - `faithfulness` - `ndcg_cut_10` - `ndcg_cut_5` - `recall_10` - `recall_5` - `rouge`
model_id	The id of the NIM model, as specified in Models.	String	—
name	An arbitrary name for to identify the configuration. If you don’t specify a name, the default is the ID associated with the configuration.	String	—
namespace	An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is `default`.	String	—
ownership	Information about the creator of the configuration, and who can use it. This field is for sharing configurations across organizations.	Object	- `created_by` - `access_policies`
parallelism	The parallelism of job running the benchmark. Supported by `custom`, `humaneval`, `mbpp`, `transcode_cpp`, `transcode_java` evaluation config types.	Integer	—
params	A set of parameters to apply to the evaluation.	Object	- `parallelism` - `request_timeout` - `max_retries` - `limit_samples` - `max_tokens` - `temperature` - `top_p` - `extra`
project	The ID of a project to associate with the configuration.	String	—
request_timeout	The time in milliseconds that the evaluation job waits for a response from the model before it fails.	Integer	—
stop	Up to 4 sequences where the API will stop generating further tokens.	String or List	—
tasks	A dictionary of evaluation tasks to run.	Object	- `type` - `params` - `dataset` - `metrics`
temperature	Adjusts the randomness of token selection. Higher values increase randomness and creativity; lower values promote deterministic and conservative output.	Number	—
top_p	A threshold that selects from the most probable tokens until the cumulative probability exceeds p.	Number	—
type	The type of evaluation that the configuration is for. For custom evaluations, set this to `custom`.	String	Some examples include: `custom`, `beir`, `mt_bench`, `gsm8k`, `retriever`, and `rag`. For more information, refer to Evaluation Types.
type (task)	The type of a task.	String	Some examples include: `custom` and `default`. For more information, refer to Evaluation Types.
url	The url for a model endpoint.	String	—

LM Harness Extra Parameters#

You can set the following task-specific parameters in the params.extra section of an LM Harness config.

Name	Description	Type	Valid Values or Child Objects
apply_chat_template	Specify whether to apply a chat template to the prompt. You can specify `true` to apply the default chat template, `false` to apply no chat template, or the name of a chat template.	Boolean or String	`true`, `false`, or a chat template name
batch_size	The batch size for the model.	Integer	—
bootstrap_iters	The number of iterations for bootstrap statistics when calculating stderrs. Specify 0 for no stderr calculations.	Integer	—
dataset_seed	A random seed for dataset shuffling.	Integer	—
fewshot_as_multiturn	`true` to provide the fewshot examples as a multiturn conversation, `false` for a single-user turn.	Boolean	`true`, `false`
hf_token	A Hugging Face account token to access tokenizers that require authenticated or authorized access.	String	—
num_fewshot	The number of examples in few-shot context.	Integer	—
seed	A random seed for Python’s random, numpy and torch. Accepts a comma-separated list of 3 values for Python’s random, numpy, and torch seeds, respectively. Specify a single integer to set the same seed for all three, for example `18`. The values are either an integer or ‘None’ to not set the seed, for example `0,None,8`. For backward compatibility, the default is `0,1234,1234`.	—	—
tokenizer	A path to the custom tokenizer to use as a benchmark.	String	—
tokenizer_backend	A backend store to use for loading tokenizer.	String	`hf` for tokenizers from HuggingFace or `tiktoken` (default).

BigCode Configurations#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. For more information, refer to BigCode Evaluation Harness.

Use the following code to create a configuration for a BigCode evaluation.

For the type of evaluation, specify the BigCode task that you want to run. For the full list of BigCode tasks, refer to tasks.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "humaneval",
        "name": "my-configuration-bigcode-humaneval-1",
        "namespace": "my-organization",
        "params": {
            "parallelism": 1,
            "limit_samples": 1,
            "max_tokens": 512,
            "temperature": 1.0,
            "top_p": 0.0,
            "extra": {
                "batch_size": 1,
                "top_k": 1
            }
        }
    }'

Python

data = {
    "type": "humaneval",
    "name": "my-configuration-bigcode-humaneval-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.0,
        "extra": {
            "batch_size": 1,
            "top_k": 1
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

LM Evaluation Harness Configurations#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and hellaswag. For more information, refer to LM Evaluation Harness.

Use the following code to create a configuration for an LM Evaluation Harness evaluation.

For the type of evaluation, specify the LM Evaluation Harness task that you want to run. For the full list of tasks, refer to tasks.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gpqa",
        "name": "my-configuration-lm-harness-gpqa-1",
        "namespace": "my-organization",
        "tasks": {
            "gpqa_diamond_generative_n_shot": {
                "type": "gpqa_diamond_generative_n_shot"
            }
        },
        "params": {
            "max_tokens": 1024,
            "temperature": 1.0,
            "top_p": 0.0,
            "stop": [
                "<|endoftext|>",
                "<extra_id_1>"
            ],
            "extra": {
                "use_greedy": true,
                "top_k": 1
            }
        }
    }'

Python

data = {
    "type": "gpqa",
    "name": "my-configuration-lm-harness-gpqa-1",
    "namespace": "my-organization",
    "tasks": {
        "gpqa_diamond_generative_n_shot": {
            "type": "gpqa_diamond_generative_n_shot"
        }
    },
    "params": {
        "max_tokens": 1024,
        "temperature": 1.0,
        "top_p": 0.0,
        "stop": [
            "<|endoftext|>",
            "<extra_id_1>"
        ],
        "extra": {
            "use_greedy": True,
            "top_k": 1
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

You can set task-specific parameters in the params.extra section of the config as shown in the following example. For more information, refer to LM Harness Extra Parameters.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gsm8k",
        "name": "my-configuration-lm-harness-gsm8k-1",
        "namespace": "my-organization",
        "tasks": {
            "gsm8k_cot_llama": {
                "type": "gsm8k_cot_llama"
            }
        },
        "params": {
            "temperature": 0.00001,      
            "top_p": 0.00001,
            "max_tokens": 256,
            "stop": ["<|eot|>"],
            "extra": {
                "num_fewshot": 8,
                "batch_size": 16,
                "bootstrap_iters": 100000,
                "dataset_seed": 42,
                "use_greedy": true,
                "top_k": 1,
                "hf_token": "<my-token>",
                "tokenizer_backend": "hf",
                "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
                "apply_chat_template": true,
                "fewshot_as_multiturn": true
            }
        }
    }'

Python

data = {
    "type": "gsm8k",
    "name": "my-configuration-lm-harness-gsm8k-1",
    "namespace": "my-organization",
    "tasks": {
        "gsm8k_cot_llama": {
            "type": "gsm8k_cot_llama"
        }
    },
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "hf_token": "<my-token>",
            "tokenizer_backend": "hf",
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
            "apply_chat_template": True,
            "fewshot_as_multiturn": True
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Similarity Metrics Configurations#

Similarity Metrics evaluation enables evaluating a model on custom datasets, by comparing the LLM generated response with a ground truth response. For more information, refer to Similarity Metrics.

Use the following code to create a configuration for a Similarity Metrics evaluation. For more information about custom data, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "similarity_metrics",
        "name": "my-configuration-similarity-1",
        "namespace": "my-organization",
        "params": {
            "max_tokens": 200,
            "temperature": 0.7,
            "extra": {
                "top_k": 20
            }
        }, 
        "tasks": {
            "my-similarity-metrics-task": {
                "type": "default",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "metrics": {
                    "accuracy": {"type": "accuracy"},
                    "bleu": {"type": "bleu"},
                    "rouge": {"type": "rouge"},
                    "em": {"type": "em"},
                    "f1": {"type": "f1"}
                }
            }
        }
    }'

Python

data = {
    "type": "similarity_metrics",
    "name": "my-configuration-similarity-1",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 200,
        "temperature": 0.7,
        "extra": {
            "top_k": 20
        }
    }, 
    "tasks": {
        "my-similarity-metrics-task": {
            "type": "default",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "metrics": {
                "accuracy": {"type": "accuracy"},
                "bleu": {"type": "bleu"},
                "rouge": {"type": "rouge"},
                "em": {"type": "em"},
                "f1": {"type": "f1"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

LLM-as-a-Judge Configurations#

With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. For more information, refer to LLM-as-a-Judge.

Example Configuration for LLM-as-a-Judge (Standard MT-Bench Data)#

Use the following code to create a configuration for an LLM-as-a-Judge evaluation that uses the standard MT-Bench data.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "mt_bench",
        "name": "my-configuration-judge-1",
        "namespace": "my-organization",
        "params": {
            "max_tokens": 1024,
            "temperature": 0.75,
            "top_p": 0.9,
            "stop": [],
            "extra": {
                "top_k": 40
            }
        },           
        "tasks": {
            "my-mt-bench": {        
                "type": "mt_bench",
                "params": {
                    "judge_model": {
                        "api_endpoint": {
                            "url": "<my-nim-deployment-base-url>/completions",
                            "model_id": "<my-model>"
                        }
                    },
                    "judge_inference_params": {
                        "max_tokens": 2048,
                        "temperature": 1.0e-05,
                        "top_p": 1.0e-05,
                        "stop": [],
                        "top_k": 1
                    }
                }
            }
        }
    }'

Python

data = {
    "type": "mt_bench",
    "name": "my-configuration-judge-1",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 1024,
        "temperature": 0.75,
        "top_p": 0.9,
        "stop": [],
        "extra": {
            "top_k": 40
        }
    },           
    "tasks": {
        "my-mt-bench": {        
            "type": "mt_bench",
            "params": {
                "judge_model": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>"
                    }
                },
                "judge_inference_params": {
                    "max_tokens": 2048,
                    "temperature": 1.0e-05,
                    "top_p": 1.0e-05,
                    "stop": [],
                    "top_k": 1
                }
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for LLM-as-a-Judge (OpenAI-compatible Judge LLM)#

Use the following code to create a configuration for an LLM-as-a-Judge evaluation that uses an OpenAI-compatible Judge LLM. To provide credentials for authenticating with an OpenAI-compatible Judge LLM, include an api_key in the judge_model.api_endpoint field.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "mt_bench",
        "name": "my-configuration-judge-2",
        "namespace": "my-organization",
        "params": {
            "max_tokens": 1024,
            "temperature": 0.75,
            "top_p": 0.9,
            "stop": [],
            "extra": {
                "top_k": 40
            }
        },        
        "tasks": {
            "mt_bench": {
                "type": "mt_bench",
                "params": {
                    "judge_model": {
                        "api_endpoint": {
                            "url": "<my-nim-deployment-base-url>/completions",
                            "model_id": "<my-model>",
                            "api_key": "<openai-api-key>"
                        }
                    },
                    "judge_inference_params": {
                        "max_tokens": 2048,
                        "temperature": 1.0e-05,
                        "top_p": 1.0e-05,
                        "stop": [],
                        "top_k": 1
                    }
                }
            }
        }
    }'

Python

data = {
    "type": "mt_bench",
    "name": "my-configuration-judge-2",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 1024,
        "temperature": 0.75,
        "top_p": 0.9,
        "stop": [],
        "extra": {
            "top_k": 40
        }
    },        
    "tasks": {
        "mt_bench": {
            "type": "mt_bench",
            "params": {
                "judge_model": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_inference_params": {
                    "max_tokens": 2048,
                    "temperature": 1.0e-05,
                    "top_p": 1.0e-05,
                    "stop": [],
                    "top_k": 1
                }
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Retriever Pipeline Configurations#

NeMo Evaluator provides support for evaluation of retriever pipelines on standard academic datasets and custom datasets. For more information, refer to Retriever Pipelines.

Example Configuration for Embedding + Reranking (Standard Data)#

Use the following code to create a configuration for a retriever evaluation, with embedding + reranking, that uses standard data.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "retriever",
        "name": "my-configuration-retriever-1",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "file://fiqa/"
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"}
                }
            }
        }
    }'

Python

data = {
    "type": "retriever",
    "name": "my-configuration-retriever-1",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for Embedding + Reranking (Custom Data)#

Use the following code to create a configuration for a retriever evaluation, with embedding + reranking, that uses custom data.

This example specifies data that is in the BEIR format. You can also use data in the SQuAD format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "retriever",
        "name": "my-configuration-retriever-2",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"}
                }
            }
        }
    }'

Python

data = {
    "type": "retriever",
    "name": "my-configuration-retriever-2",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

RAG Pipeline Configurations#

NeMo Evaluator provides support for evaluating RAG pipelines, which are built by pipelining NeMo Retriever and LLM. For more information, refer to RAG Pipelines.

Example Configuration for Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

Use the following code to create a configuration for a Retrieval + Answer Generation + Answer Evaluation pipeline with standard data.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-1",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "file://nfcorpus/"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-judge-llm-url>",
                            "model_id": "<my-judge-llm-model>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-judge-embedding-url>",
                            "model_id": "<my-judge-embedding-model>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"},
                    "answer_relevancy": {"type": "answer_relevancy"}
                }
            }
        }
    }'

Python

data = {
    "type": "rag",
    "name": "my-configuration-rag-1",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

Use the following code to create a configuration for a Retrieval + Answer Generation + Answer Evaluation pipeline with custom data.

This example specifies data that is in the BEIR format. You can also use data in the SQuAD format or the Ragas format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-2",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-judge-llm-url>",
                            "model_id": "<my-judge-llm-model>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-judge-embedding-url>",
                            "model_id": "<my-judge-embedding-model>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"},
                    "answer_relevancy": {"type": "answer_relevancy"}
                }
            }
        }
    }'

Python

data = {
    "type": "rag",
    "name": "my-configuration-rag-2",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for Answer Evaluation (Pre-generated Answers)#

Use the following code to create a configuration for an answer evaluation with custom pre-generated answers.

This example specifies data that is in the Ragas format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-3",
        "namespace": "my-organization",
        "tasks": {
            "my-ragas-task": {
                "type": "ragas",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-judge-llm-url>",
                            "model_id": "<my-judge-llm-model>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-query-embedding-url>",
                            "model_id": "<my-query-embedding-model>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"}
                }
            }
        }
    }'

Python

data = {
    "type": "rag",
    "name": "my-configuration-rag-3",
    "namespace": "my-organization",
    "tasks": {
        "my-ragas-task": {
            "type": "ragas",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for RAG (OpenAI-compatible Judge LLM)#

Use the following code to create a configuration for a RAG pipeline evaluation that uses an OpenAI-compatible Judge LLM. To provide credentials for authenticating with an OpenAI-compatible Judge LLM, include an api_key in the judge_llm field.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-4",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "file://nfcorpus/"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-nim-deployment-base-url>/completions",
                            "model_id": "<my-model>",
                            "api_key": "<openai-api-key>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-query-embedding-url>",
                            "model_id": "<my-query-embedding-model>",
                            "api_key": "<openai-api-key>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"},
                    "answer_relevancy": {"type": "answer_relevancy"}
                }
            }
        }
    }'

Python

data = {
    "type": "rag",
    "name": "my-configuration-rag-4",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Delete a Config#

To delete an evaluation configuration, send a DELETE request to the configs endpoint. You must provide both the namespace and ID of the config as shown in the following code.

Caution

Before you delete a config, ensure that no jobs use it. If a job uses the config, you must delete the job first. To find all jobs that use a config, refer to Example: Filter Jobs by Config.

curl

curl -X "DELETE" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs/<my-namespace>/<my-config-id>" \
  -H 'accept: application/json'

Python

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs/<my-namespace>/<my-config-id>"
response = requests.delete(endpoint).json()
response

When you delete a config, the response is similar to the following.

{
    "message": "Resource deleted successfully.",
    "id": "eval-config-MNOP1234QRST5678",
    "deleted_at": null
}

Create Config Response#

When you create a configuration for an evaluation, the response is similar to the following.

For the full response reference, refer to Evaluator API.

{
    "created_at": "2025-03-19T22:50:02.206136",
    "updated_at": "2025-03-19T22:50:02.206138",
    "id": "eval-config-MNOP1234QRST5678",
    "name": "my-configuration-lm-harness-gsm8k-1",
    "namespace": "my-organization",
    "type": "gsm8k",
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": true,
            "top_k": 1,
            "hf_token": "<my-token>",
            "tokenizer_backend": "hf",
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
            "apply_chat_template": true,
            "fewshot_as_multiturn": true
        }
    },
    "custom_fields": {}
}