Evaluation Config Schema#

When you create a configuration for an evaluation, you send a JSON data structure that contains the information for your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

The following table contains selected field reference for the JSON data. For the full API reference, refer to Job JSON Schema Reference.

Name	Description	Type	Valid Values or Child Objects
access_policies	The policies that control who can use the configuration. This field is for sharing configurations across organizations.	Object	—
api_endpoint	The endpoint for a model.	Object	- `url` - `model_id` - `api_key`
api_key	The key to access an API endpoint.	String	—
created_by	The ID of the user that created the configuration. This field is for sharing configurations across organizations.	String	—
custom_fields	An optional object that you can use to store additional information.	Object	—
dataset	A dataset to use for the evaluation.	Object	- `files_url` - `format`
description	A description of the configuration.	String	—
extra	Additional parameters for academic benchmarks.	Object	—
files_url	The url for a file that contains pre-generated data. Use `hf://` as prefix for files stored in NeMo Data Store, for example `hf://datasets/<your-dataset-name>/<your-file-name>`. The value can also be a name of a standard dataset, such as `fiqa` and `nfcorpus` - refer to those using `file://` prefix, for example `file://fiqa/`. For format information, refer to Custom Data.	String	—
format	The format of a data file. For format information, refer to Custom Data.	String	- `beir` - `squad` - `ragas`
groups	A dictionary of evaluation tasks to run in a group.	Object	- `type` - `params` - `dataset` - `metrics`
hf_token	A Hugging Face account token. For some benchmark datasets, a valid Hugging Face token is required to access the data. For example, task `gpqa` requires access to dataset. Other benchmark tasks don’t require a token, but setting the token can help avoid rate limiting constraints by switching to account-based rate limits.	String	—
id	The ID of the configuration. The ID is returned in the response when you create a configuration.	String	—
judge_llm	The model to use to judge the answer.	Object	- `api_endpoint`
limit_samples	The number of samples to evaluate.	Integer	—
max_tokens	The maximum number of tokens to generate during inference.	Integer	—
max_retries	The number of times an evaluation job retries a request to a model after a failure.	Integer	—
metrics	A dictionary of objects in the form `{"name": "<metric-name>"}`, where metric-name is a metric to use for the evaluation. Depending on the metric, you might need to provide a ground truth in the file `input.json`. For format information, refer to Custom Data.	Object	- `accuracy` - `answer_relevancy` - `bleu` - `em` - `f1` - `faithfulness` - `ndcg_cut_10` - `ndcg_cut_5` - `recall_10` - `recall_5` - `rouge`
model_id	The id of the NIM model, as specified in Models.	String	—
name	An arbitrary name for to identify the configuration. If you don’t specify a name, the default is the ID associated with the configuration.	String	—
namespace	An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is `default`.	String	—
ownership	Information about the creator of the configuration, and who can use it. This field is for sharing configurations across organizations.	Object	- `created_by` - `access_policies`
parallelism	The parallelism of job running the benchmark. Supported by `custom`, `humaneval`, `mbpp`, `transcode_cpp`, `transcode_java` evaluation config types.	Integer	—
params	A set of parameters to apply to the evaluation.	Object	- `parallelism` - `request_timeout` - `max_retries` - `limit_samples` - `max_tokens` - `temperature` - `top_p` - `extra`
project	The ID of a project to associate with the configuration.	String	—
request_timeout	The time in milliseconds that the evaluation job waits for a response from the model before it fails.	Integer	—
stop	Up to 4 sequences where the API will stop generating further tokens.	String or List	—
tasks	A dictionary of evaluation tasks to run.	Object	- `type` - `params` - `dataset` - `metrics`
temperature	Adjusts the randomness of token selection. Higher values increase randomness and creativity; lower values promote deterministic and conservative output.	Number	—
top_p	A threshold that selects from the most probable tokens until the cumulative probability exceeds p.	Number	—
type	The type of evaluation that the configuration is for. For custom evaluations, set this to `custom`.	String	Some examples include: `custom`, `beir`, `mt_bench`, `gsm8k`, `retriever`, and `rag`. For more information, refer to Evaluation Types.
type (task)	The type of a task.	String	Some examples include: `custom` and `default`. For more information, refer to Evaluation Types.
url	The url for a model endpoint.	String	—

</rewritten_file>