Evaluation Config Schema#

When you create a configuration for an evaluation, you send a JSON data structure that contains the information for your configuration.

Important

Each configuration is uniquely identified by a combination of namespace and name. For example my-organization/my-configuration.

The following table contains selected field reference for the JSON data. For the full API reference, refer to Job JSON Schema Reference.

Name

Description

Type

Valid Values or Child Objects

access_policies

The policies that control who can use the configuration. This field is for sharing configurations across organizations.

Object

api_endpoint

The endpoint for a model.

Object

- url
- model_id
- api_key

api_key

The key to access an API endpoint.

String

created_by

The ID of the user that created the configuration. This field is for sharing configurations across organizations.

String

custom_fields

An optional object that you can use to store additional information.

Object

dataset

A dataset to use for the evaluation.

Object

- files_url
- format

description

A description of the configuration.

String

extra

Additional parameters for academic benchmarks.

Object

files_url

The url for a file that contains pre-generated data. Use hf:// as prefix for files stored in NeMo Data Store, for example hf://datasets/<your-dataset-name>/<your-file-name>. The value can also be a name of a standard dataset, such as fiqa and nfcorpus - refer to those using file:// prefix, for example file://fiqa/. For format information, refer to Custom Data.

String

format

The format of a data file. For format information, refer to Custom Data.

String

- beir
- squad
- ragas

groups

A dictionary of evaluation tasks to run in a group.

Object

- type
- params
- dataset
- metrics

hf_token

A Hugging Face account token. For some benchmark datasets, a valid Hugging Face token is required to access the data. For example, task gpqa requires access to dataset. Other benchmark tasks don’t require a token, but setting the token can help avoid rate limiting constraints by switching to account-based rate limits.

String

id

The ID of the configuration. The ID is returned in the response when you create a configuration.

String

judge_llm

The model to use to judge the answer.

Object

- api_endpoint

limit_samples

The number of samples to evaluate.

Integer

max_tokens

The maximum number of tokens to generate during inference.

Integer

max_retries

The number of times an evaluation job retries a request to a model after a failure.

Integer

metrics

A dictionary of objects in the form {"name": "<metric-name>"}, where metric-name is a metric to use for the evaluation. Depending on the metric, you might need to provide a ground truth in the file input.json. For format information, refer to Custom Data.

Object

- accuracy
- answer_relevancy
- bleu
- em
- f1
- faithfulness
- ndcg_cut_10
- ndcg_cut_5
- recall_10
- recall_5
- rouge

model_id

The id of the NIM model, as specified in Models.

String

name

An arbitrary name for to identify the configuration. If you don’t specify a name, the default is the ID associated with the configuration.

String

namespace

An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is default.

String

ownership

Information about the creator of the configuration, and who can use it. This field is for sharing configurations across organizations.

Object

- created_by
- access_policies

parallelism

The parallelism of job running the benchmark. Supported by custom, humaneval, mbpp, transcode_cpp, transcode_java evaluation config types.

Integer

params

A set of parameters to apply to the evaluation.

Object

- parallelism
- request_timeout
- max_retries
- limit_samples
- max_tokens
- temperature
- top_p
- extra

project

The ID of a project to associate with the configuration.

String

request_timeout

The time in milliseconds that the evaluation job waits for a response from the model before it fails.

Integer

stop

Up to 4 sequences where the API will stop generating further tokens.

String or List

tasks

A dictionary of evaluation tasks to run.

Object

- type
- params
- dataset
- metrics

temperature

Adjusts the randomness of token selection. Higher values increase randomness and creativity; lower values promote deterministic and conservative output.

Number

top_p

A threshold that selects from the most probable tokens until the cumulative probability exceeds p.

Number

type

The type of evaluation that the configuration is for. For custom evaluations, set this to custom.

String

Some examples include: custom, beir, mt_bench, gsm8k, retriever, and rag. For more information, refer to Evaluation Types.

type (task)

The type of a task.

String

Some examples include: custom and default. For more information, refer to Evaluation Types.

url

The url for a model endpoint.

String

</rewritten_file>