Create and Manage Evaluation Configurations#

When you run an evaluation in NVIDIA NeMo Evaluator, you create a separate target and configuration for the evaluation.

Tip

Because NeMo Evaluator separates the target and the configuration, you can create a configuration once, and reuse it multiple times with different targets (for example, to compare models). To see what targets and configurations are supported together, refer to Combine Evaluation Targets and Configurations.

Evaluator API URL#

To create a configuration for an evaluation, send a POST request to the /evaluation/configs API. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. For more information, refer to NeMo Evaluator Deployment Guide.

The examples in this documentation specify {EVALUATOR_HOSTNAME} in the code. Do the following to store the evaluator hostname to use it in your code.

Important

Replace <your evaluator service endpoint> with your address, such as evaluator.internal.your-company.com, before you run this code.

export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
import requests

EVALUATOR_HOSTNAME = "<your evaluator service endpoint>" 

Example Config#

The following is the partial structure of the code to create an evaluation configuration. Use the rest of this documentation to see examples and reference to create a config specific to your scenario.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "<evaluation-type>",
        "name": "<my-configuration-name>",
        "namespace": "<my-namespace>",

        // More config details
    }'
data = {
    "type": "<evaluation-type>",
    "name": "<my-configuration-name>",
    "namespace": "<my-namespace>",

    // More config details
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

To see a sample response, refer to Create Config Response.

Configuration JSON Reference#

When you create a configuration for an evaluation, you send a JSON data structure that contains the information for your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

The following table contains selected field reference for the JSON data. For the full API reference, refer to Evaluator API.

Name

Description

Type

Valid Values or Child Objects

access_policies

The policies that control who can use the configuration. This field is for sharing configurations across organizations.

Object

api_endpoint

The endpoint for a model.

Object

- url
- model_id
- api_key

api_key

The key to access an API endpoint.

String

created_by

The ID of the user that created the configuration. This field is for sharing configurations across organizations.

String

custom_fields

An optional object that you can use to store additional information.

Object

dataset

A dataset to use for the evaluation.

Object

- files_url
- format

description

A description of the configuration.

String

extra

Additional parameters for academic benchmarks.

Object

files_url

The url for a file that contains pre-generated data. Use hf:// as prefix for files stored in NeMo Data Store, for example hf://datasets/<your-dataset-name>/<your-file-name>. The value can also be a name of a standard dataset, such as fiqa and nfcorpus - refer to those using file:// prefix, for example file://fiqa/. For format information, refer to Custom Data.

String

format

The format of a data file. For format information, refer to Custom Data.

String

- beir
- squad
- ragas

groups

A dictionary of evaluation tasks to run in a group.

Object

- type
- params
- dataset
- metrics

hf_token

A Hugging Face account token. For some benchmark datasets, a valid Hugging Face token is required to access the data. For example, task gpqa requires access to dataset. Other benchmark tasks don’t require a token, but setting the token can help avoid rate limiting constraints by switching to account-based rate limits.

String

id

The ID of the configuration. The ID is returned in the response when you create a configuration.

String

judge_llm

The model to use to judge the answer.

Object

- api_endpoint

judge_model

The model to use to judge the answer.

Object

- api_endpoint

limit_samples

The number of samples to evaluate.

Integer

max_tokens

The maximum number of tokens to generate during inference.

Integer

max_retries

The number of times an evaluation job retries a request to a model after a failure.

Integer

metrics

A dictionary of objects in the form {"name": "<metric-name>"}, where metric-name is a metric to use for the evaluation. Depending on the metric, you might need to provide a ground truth in the file input.json. For format information, refer to Custom Data.

Object

- accuracy
- answer_relevancy
- bleu
- em
- f1
- faithfulness
- ndcg_cut_10
- ndcg_cut_5
- recall_10
- recall_5
- rouge

model_id

The id of the NIM model, as specified in Models.

String

name

An arbitrary name for to identify the configuration. If you don’t specify a name, the default is the ID associated with the configuration.

String

namespace

An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is default.

String

ownership

Information about the creator of the configuration, and who can use it. This field is for sharing configurations across organizations.

Object

- created_by
- access_policies

parallelism

The parallelism of job running the benchmark. Supported by custom, humaneval, mbpp, transcode_cpp, transcode_java evaluation config types.

Integer

params

A set of parameters to apply to the evaluation.

Object

- parallelism
- request_timeout
- max_retries
- limit_samples
- max_tokens
- temperature
- top_p
- extra

project

The ID of a project to associate with the configuration.

String

request_timeout

The time in milliseconds that the evaluation job waits for a response from the model before it fails.

Integer

stop

Up to 4 sequences where the API will stop generating further tokens.

String or List

tasks

A dictionary of evaluation tasks to run.

Object

- type
- params
- dataset
- metrics

temperature

Adjusts the randomness of token selection. Higher values increase randomness and creativity; lower values promote deterministic and conservative output.

Number

top_p

A threshold that selects from the most probable tokens until the cumulative probability exceeds p.

Number

type

The type of evaluation that the configuration is for. For custom evaluations, set this to custom.

String

Some examples include: custom, beir, mt_bench, gsm8k, retriever, and rag. For more information, refer to Evaluation Types.

type (task)

The type of a task.

String

Some examples include: custom and default. For more information, refer to Evaluation Types.

url

The url for a model endpoint.

String

LM Harness Extra Parameters#

You can set the following task-specific parameters in the params.extra section of an LM Harness config.

Name

Description

Type

Valid Values or Child Objects

apply_chat_template

Specify whether to apply a chat template to the prompt. You can specify true to apply the default chat template, false to apply no chat template, or the name of a chat template.

Boolean or String

true, false, or a chat template name

batch_size

The batch size for the model.

Integer

bootstrap_iters

The number of iterations for bootstrap statistics when calculating stderrs. Specify 0 for no stderr calculations.

Integer

dataset_seed

A random seed for dataset shuffling.

Integer

fewshot_as_multiturn

true to provide the fewshot examples as a multiturn conversation, false for a single-user turn.

Boolean

true, false

hf_token

A Hugging Face account token to access tokenizers that require authenticated or authorized access.

String

num_fewshot

The number of examples in few-shot context.

Integer

seed

A random seed for Python’s random, numpy and torch. Accepts a comma-separated list of 3 values for Python’s random, numpy, and torch seeds, respectively. Specify a single integer to set the same seed for all three, for example 18. The values are either an integer or ‘None’ to not set the seed, for example 0,None,8. For backward compatibility, the default is 0,1234,1234.

tokenizer

A path to the custom tokenizer to use as a benchmark.

String

tokenizer_backend

A backend store to use for loading tokenizer.

String

hf for tokenizers from HuggingFace or tiktoken (default).

BigCode Configurations#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. For more information, refer to BigCode Evaluation Harness.

Use the following code to create a configuration for a BigCode evaluation.

For the type of evaluation, specify the BigCode task that you want to run. For the full list of BigCode tasks, refer to tasks.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "humaneval",
        "name": "my-configuration-bigcode-humaneval-1",
        "namespace": "my-organization",
        "params": {
            "parallelism": 1,
            "limit_samples": 1,
            "max_tokens": 512,
            "temperature": 1.0,
            "top_p": 0.0,
            "extra": {
                "batch_size": 1,
                "top_k": 1
            }
        }
    }'
data = {
    "type": "humaneval",
    "name": "my-configuration-bigcode-humaneval-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.0,
        "extra": {
            "batch_size": 1,
            "top_k": 1
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

LM Evaluation Harness Configurations#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and hellaswag. For more information, refer to LM Evaluation Harness.

Use the following code to create a configuration for an LM Evaluation Harness evaluation.

For the type of evaluation, specify the LM Evaluation Harness task that you want to run. For the full list of tasks, refer to tasks.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gpqa",
        "name": "my-configuration-lm-harness-gpqa-1",
        "namespace": "my-organization",
        "tasks": {
            "gpqa_diamond_generative_n_shot": {
                "type": "gpqa_diamond_generative_n_shot"
            }
        },
        "params": {
            "max_tokens": 1024,
            "temperature": 1.0,
            "top_p": 0.0,
            "stop": [
                "<|endoftext|>",
                "<extra_id_1>"
            ],
            "extra": {
                "use_greedy": true,
                "top_k": 1
            }
        }
    }'
data = {
    "type": "gpqa",
    "name": "my-configuration-lm-harness-gpqa-1",
    "namespace": "my-organization",
    "tasks": {
        "gpqa_diamond_generative_n_shot": {
            "type": "gpqa_diamond_generative_n_shot"
        }
    },
    "params": {
        "max_tokens": 1024,
        "temperature": 1.0,
        "top_p": 0.0,
        "stop": [
            "<|endoftext|>",
            "<extra_id_1>"
        ],
        "extra": {
            "use_greedy": True,
            "top_k": 1
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

You can set task-specific parameters in the params.extra section of the config as shown in the following example. For more information, refer to LM Harness Extra Parameters.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "gsm8k",
        "name": "my-configuration-lm-harness-gsm8k-1",
        "namespace": "my-organization",
        "tasks": {
            "gsm8k_cot_llama": {
                "type": "gsm8k_cot_llama"
            }
        },
        "params": {
            "temperature": 0.00001,      
            "top_p": 0.00001,
            "max_tokens": 256,
            "stop": ["<|eot|>"],
            "extra": {
                "num_fewshot": 8,
                "batch_size": 16,
                "bootstrap_iters": 100000,
                "dataset_seed": 42,
                "use_greedy": true,
                "top_k": 1,
                "hf_token": "<my-token>",
                "tokenizer_backend": "hf",
                "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
                "apply_chat_template": true,
                "fewshot_as_multiturn": true
            }
        }
    }'
data = {
    "type": "gsm8k",
    "name": "my-configuration-lm-harness-gsm8k-1",
    "namespace": "my-organization",
    "tasks": {
        "gsm8k_cot_llama": {
            "type": "gsm8k_cot_llama"
        }
    },
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "hf_token": "<my-token>",
            "tokenizer_backend": "hf",
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
            "apply_chat_template": True,
            "fewshot_as_multiturn": True
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Similarity Metrics Configurations#

Similarity Metrics evaluation enables evaluating a model on custom datasets, by comparing the LLM generated response with a ground truth response. For more information, refer to Similarity Metrics.

Use the following code to create a configuration for a Similarity Metrics evaluation. For more information about custom data, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "similarity_metrics",
        "name": "my-configuration-similarity-1",
        "namespace": "my-organization",
        "params": {
            "max_tokens": 200,
            "temperature": 0.7,
            "extra": {
                "top_k": 20
            }
        }, 
        "tasks": {
            "my-similarity-metrics-task": {
                "type": "default",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "metrics": {
                    "accuracy": {"type": "accuracy"},
                    "bleu": {"type": "bleu"},
                    "rouge": {"type": "rouge"},
                    "em": {"type": "em"},
                    "f1": {"type": "f1"}
                }
            }
        }
    }'
data = {
    "type": "similarity_metrics",
    "name": "my-configuration-similarity-1",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 200,
        "temperature": 0.7,
        "extra": {
            "top_k": 20
        }
    }, 
    "tasks": {
        "my-similarity-metrics-task": {
            "type": "default",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "metrics": {
                "accuracy": {"type": "accuracy"},
                "bleu": {"type": "bleu"},
                "rouge": {"type": "rouge"},
                "em": {"type": "em"},
                "f1": {"type": "f1"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

LLM-as-a-Judge Configurations#

With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. For more information, refer to LLM-as-a-Judge.

Example Configuration for LLM-as-a-Judge (Standard MT-Bench Data)#

Use the following code to create a configuration for an LLM-as-a-Judge evaluation that uses the standard MT-Bench data.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "mt_bench",
        "name": "my-configuration-judge-1",
        "namespace": "my-organization",
        "params": {
            "max_tokens": 1024,
            "temperature": 0.75,
            "top_p": 0.9,
            "stop": [],
            "extra": {
                "top_k": 40
            }
        },           
        "tasks": {
            "my-mt-bench": {        
                "type": "mt_bench",
                "params": {
                    "judge_model": {
                        "api_endpoint": {
                            "url": "<my-nim-deployment-base-url>/completions",
                            "model_id": "<my-model>"
                        }
                    },
                    "judge_inference_params": {
                        "max_tokens": 2048,
                        "temperature": 1.0e-05,
                        "top_p": 1.0e-05,
                        "stop": [],
                        "top_k": 1
                    }
                }
            }
        }
    }'
data = {
    "type": "mt_bench",
    "name": "my-configuration-judge-1",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 1024,
        "temperature": 0.75,
        "top_p": 0.9,
        "stop": [],
        "extra": {
            "top_k": 40
        }
    },           
    "tasks": {
        "my-mt-bench": {        
            "type": "mt_bench",
            "params": {
                "judge_model": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>"
                    }
                },
                "judge_inference_params": {
                    "max_tokens": 2048,
                    "temperature": 1.0e-05,
                    "top_p": 1.0e-05,
                    "stop": [],
                    "top_k": 1
                }
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for LLM-as-a-Judge (OpenAI-compatible Judge LLM)#

Use the following code to create a configuration for an LLM-as-a-Judge evaluation that uses an OpenAI-compatible Judge LLM. To provide credentials for authenticating with an OpenAI-compatible Judge LLM, include an api_key in the judge_model.api_endpoint field.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "mt_bench",
        "name": "my-configuration-judge-2",
        "namespace": "my-organization",
        "params": {
            "max_tokens": 1024,
            "temperature": 0.75,
            "top_p": 0.9,
            "stop": [],
            "extra": {
                "top_k": 40
            }
        },        
        "tasks": {
            "mt_bench": {
                "type": "mt_bench",
                "params": {
                    "judge_model": {
                        "api_endpoint": {
                            "url": "<my-nim-deployment-base-url>/completions",
                            "model_id": "<my-model>",
                            "api_key": "<openai-api-key>"
                        }
                    },
                    "judge_inference_params": {
                        "max_tokens": 2048,
                        "temperature": 1.0e-05,
                        "top_p": 1.0e-05,
                        "stop": [],
                        "top_k": 1
                    }
                }
            }
        }
    }'
data = {
    "type": "mt_bench",
    "name": "my-configuration-judge-2",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 1024,
        "temperature": 0.75,
        "top_p": 0.9,
        "stop": [],
        "extra": {
            "top_k": 40
        }
    },        
    "tasks": {
        "mt_bench": {
            "type": "mt_bench",
            "params": {
                "judge_model": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_inference_params": {
                    "max_tokens": 2048,
                    "temperature": 1.0e-05,
                    "top_p": 1.0e-05,
                    "stop": [],
                    "top_k": 1
                }
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Retriever Pipeline Configurations#

NeMo Evaluator provides support for evaluation of retriever pipelines on standard academic datasets and custom datasets. For more information, refer to Retriever Pipelines.

Example Configuration for Embedding + Reranking (Standard Data)#

Use the following code to create a configuration for a retriever evaluation, with embedding + reranking, that uses standard data.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "retriever",
        "name": "my-configuration-retriever-1",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "file://fiqa/"
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"}
                }
            }
        }
    }'
data = {
    "type": "retriever",
    "name": "my-configuration-retriever-1",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for Embedding + Reranking (Custom Data)#

Use the following code to create a configuration for a retriever evaluation, with embedding + reranking, that uses custom data.

This example specifies data that is in the BEIR format. You can also use data in the SQuAD format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "retriever",
        "name": "my-configuration-retriever-2",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"}
                }
            }
        }
    }'
data = {
    "type": "retriever",
    "name": "my-configuration-retriever-2",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

RAG Pipeline Configurations#

NeMo Evaluator provides support for evaluating RAG pipelines, which are built by pipelining NeMo Retriever and LLM. For more information, refer to RAG Pipelines.

Example Configuration for Retrieval + Answer Generation + Answer Evaluation (Standard Data)#

Use the following code to create a configuration for a Retrieval + Answer Generation + Answer Evaluation pipeline with standard data.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-1",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "file://nfcorpus/"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-judge-llm-url>",
                            "model_id": "<my-judge-llm-model>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-judge-embedding-url>",
                            "model_id": "<my-judge-embedding-model>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"},
                    "answer_relevancy": {"type": "answer_relevancy"}
                }
            }
        }
    }'
data = {
    "type": "rag",
    "name": "my-configuration-rag-1",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for Retrieval + Answer Generation + Answer Evaluation (Custom Data)#

Use the following code to create a configuration for a Retrieval + Answer Generation + Answer Evaluation pipeline with custom data.

This example specifies data that is in the BEIR format. You can also use data in the SQuAD format or the Ragas format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-2",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-judge-llm-url>",
                            "model_id": "<my-judge-llm-model>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-judge-embedding-url>",
                            "model_id": "<my-judge-embedding-model>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"},
                    "answer_relevancy": {"type": "answer_relevancy"}
                }
            }
        }
    }'
data = {
    "type": "rag",
    "name": "my-configuration-rag-2",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-judge-embedding-url>",
                        "model_id": "<my-judge-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for Answer Evaluation (Pre-generated Answers)#

Use the following code to create a configuration for an answer evaluation with custom pre-generated answers.

This example specifies data that is in the Ragas format. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-3",
        "namespace": "my-organization",
        "tasks": {
            "my-ragas-task": {
                "type": "ragas",
                "dataset": {
                    "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-judge-llm-url>",
                            "model_id": "<my-judge-llm-model>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-query-embedding-url>",
                            "model_id": "<my-query-embedding-model>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"}
                }
            }
        }
    }'
data = {
    "type": "rag",
    "name": "my-configuration-rag-3",
    "namespace": "my-organization",
    "tasks": {
        "my-ragas-task": {
            "type": "ragas",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-judge-llm-url>",
                        "model_id": "<my-judge-llm-model>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Example Configuration for RAG (OpenAI-compatible Judge LLM)#

Use the following code to create a configuration for a RAG pipeline evaluation that uses an OpenAI-compatible Judge LLM. To provide credentials for authenticating with an OpenAI-compatible Judge LLM, include an api_key in the judge_llm field.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "type": "rag",
        "name": "my-configuration-rag-4",
        "namespace": "my-organization",
        "tasks": {
            "my-beir-task": {
                "type": "beir",
                "dataset": {
                    "files_url": "file://nfcorpus/"
                },
                "params": {
                    "judge_llm": {
                        "api_endpoint": {
                            "url": "<my-nim-deployment-base-url>/completions",
                            "model_id": "<my-model>",
                            "api_key": "<openai-api-key>"
                        }
                    },
                    "judge_embeddings": {
                        "api_endpoint": {
                            "url": "<my-query-embedding-url>",
                            "model_id": "<my-query-embedding-model>",
                            "api_key": "<openai-api-key>"
                        }
                    },
                    "judge_timeout": 300,
                    "judge_max_retries": 5,
                    "judge_max_workers": 16
                },
                "metrics": {
                    "recall_5": {"type": "recall_5"},
                    "ndcg_cut_5": {"type": "ndcg_cut_5"},
                    "recall_10": {"type": "recall_10"},
                    "ndcg_cut_10": {"type": "ndcg_cut_10"},
                    "faithfulness": {"type": "faithfulness"},
                    "answer_relevancy": {"type": "answer_relevancy"}
                }
            }
        }
    }'
data = {
    "type": "rag",
    "name": "my-configuration-rag-4",
    "namespace": "my-organization",
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "<my-nim-deployment-base-url>/completions",
                        "model_id": "<my-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "<my-query-embedding-url>",
                        "model_id": "<my-query-embedding-model>",
                        "api_key": "<openai-api-key>"
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs"

response = requests.post(endpoint, json=data).json()

Delete a Config#

To delete an evaluation configuration, send a DELETE request to the configs endpoint. You must provide both the namespace and ID of the config as shown in the following code.

Caution

Before you delete a config, ensure that no jobs use it. If a job uses the config, you must delete the job first. To find all jobs that use a config, refer to Example: Filter Jobs by Config.

curl -X "DELETE" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/configs/<my-namespace>/<my-config-id>" \
  -H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/configs/<my-namespace>/<my-config-id>"
response = requests.delete(endpoint).json()
response

When you delete a config, the response is similar to the following.

{
    "message": "Resource deleted successfully.",
    "id": "eval-config-MNOP1234QRST5678",
    "deleted_at": null
}

Create Config Response#

When you create a configuration for an evaluation, the response is similar to the following.

For the full response reference, refer to Evaluator API.

{
    "created_at": "2025-03-19T22:50:02.206136",
    "updated_at": "2025-03-19T22:50:02.206138",
    "id": "eval-config-MNOP1234QRST5678",
    "name": "my-configuration-lm-harness-gsm8k-1",
    "namespace": "my-organization",
    "type": "gsm8k",
    "params": {
        "temperature": 0.00001,      
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": true,
            "top_k": 1,
            "hf_token": "<my-token>",
            "tokenizer_backend": "hf",
            "tokenizer": "meta-llama/Llama-3.1-8B-Instruct",
            "apply_chat_template": true,
            "fewshot_as_multiturn": true
        }
    },
    "custom_fields": {}
}