Evaluation Configuration Parameters#

Comprehensive reference for configuring evaluation tasks in NeMo Evaluator, covering universal parameters, framework-specific settings, and optimization patterns.

Quick Navigation

Looking for available benchmarks?

Need help getting started?

Overview#

All evaluation tasks in NeMo Evaluator use the ConfigParams class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the extra parameter. Default configuration (including which parameters a task uses) is defined in the Framework Definition File (FDF) for each framework; see Framework Definition File (FDF) for details.

from nemo_evaluator.api.api_dataclasses import ConfigParams

# Basic configuration
params = ConfigParams(
    temperature=0,
    top_p=1.0,
    max_new_tokens=256,
    limit_samples=100
)

# With framework-specific parameters (extra)
params = ConfigParams(
    temperature=0,
    parallelism=8,
    extra={
        "num_fewshot": 5,
        "tokenizer": "/path/to/tokenizer",
        "custom_prompt": "Answer the question:"
    }
)

How to see possible parameters for a given task

Python API (core) — Get default params and which params a task uses. Use framework_name.task_name to avoid ambiguity when the same task name exists in multiple harnesses:

from nemo_evaluator.core.input import get_available_evaluations

# Returns (framework_evals_mapping, framework_defaults, all_eval_name_mapping)
framework_evals, _, _ = get_available_evaluations()

# Use framework_name.task_name (e.g. simple_evals.mmlu_pro) for a single task
framework_name, task_name = "simple_evals", "mmlu_pro"
eval_obj = framework_evals[framework_name][task_name]

# Default params for this task (ConfigParams / dict-like)
print(eval_obj.config.params)

# Command template shows which {{ config.params.* }} the task uses
print(eval_obj.command)

CLI (core) — List tasks, then show merged config (including params) for a task:

# List available tasks
nemo-evaluator ls

# Show full rendered config (including config.params) for a task without running
# Use framework_name.task_name (e.g. simple_evals.mmlu_pro) to avoid ambiguity
nemo-evaluator run_eval --eval_type simple_evals.mmlu_pro --model_id x --model_url https://example.com/v1/chat/completions --model_type chat --output_dir ./out --dry_run

The --dry_run output prints the merged configuration (YAML) and the rendered command, so you can see which parameters apply to that task.

Launcher — If you use the launcher, nemo-evaluator-launcher ls task <task_name> (or harness.task_name) prints task details including Defaults with config.params and config.params.extra. List all tasks with nemo-evaluator-launcher ls tasks.

Universal Parameters#

These parameters are standardized across all frameworks and share the same names and semantics. That does not mean every framework supports every parameter: each task’s command template only uses a subset. If you pass a parameter that the task does not use, you will see a warning like: “Configuration contains parameter(s) that are not used in the command template” (see validate_params_in_command in nemo_evaluator.core.utils).

Category

Parameter

Type

Description

Example Values

Notes

Sampling

temperature

float

Sampling randomness

0 (deterministic), 0.7 (creative)

Use 0 for reproducible results

Sampling

top_p

float

Nucleus sampling threshold

1.0 (disabled), 0.9 (selective)

Controls diversity of generated text

Sampling

max_new_tokens

int

Maximum response length

256, 512, 1024

Limits generation length

Evaluation control

limit_samples

int/float

Evaluation subset size

100 (count), 0.1 (10% of dataset)

Use for quick testing or resource limits

Evaluation control

task

str

Task-specific identifier

"custom_task"

Used by some harnesses for task routing

Performance

parallelism

int

Concurrent request threads

1, 8, 16

Balance against server capacity

Performance

max_retries

int

Retry attempts for failed requests

3, 5, 10

Increases robustness for network issues

Performance

request_timeout

int

Request timeout (seconds)

60, 120, 300

Adjust for model response time

Framework-Specific Parameters#

Framework-specific parameters are passed through the extra dictionary within ConfigParams.

LM-Evaluation-Harness Parameters

Parameter

Type

Description

Example Values

Use Cases

num_fewshot

int

Few-shot examples count

0, 5, 25

Academic benchmarks

tokenizer

str

Tokenizer path

"/path/to/tokenizer"

Log-probability tasks

tokenizer_backend

str

Tokenizer implementation

"huggingface", "sentencepiece"

Custom tokenizer setups

trust_remote_code

bool

Allow remote code execution

True, False

For custom tokenizers

add_bos_token

bool

Add beginning-of-sequence token

True, False

Model-specific formatting

add_eos_token

bool

Add end-of-sequence token

True, False

Model-specific formatting

fewshot_delimiter

str

Separator between examples

"\\n\\n", "\\n---\\n"

Custom prompt formatting

fewshot_seed

int

Reproducible example selection

42, 1337

Ensures consistent few-shot examples

description

str

Custom prompt prefix

"Answer the question:"

Task-specific instructions

bootstrap_iters

int

Statistical bootstrap iterations

1000, 10000

For confidence intervals

Simple-Evals Parameters

Parameter

Type

Description

Example Values

Use Cases

pass_at_k

list[int]

Code evaluation metrics

[1, 5, 10]

Code generation tasks

timeout

int

Code execution timeout

5, 10, 30

Code generation tasks

max_workers

int

Parallel execution workers

4, 8, 16

Code execution parallelism

languages

list[str]

Target programming languages

["python", "java", "cpp"]

Multi-language evaluation

BigCode-Evaluation-Harness Parameters

Parameter

Type

Description

Example Values

Use Cases

num_workers

int

Parallel execution workers

4, 8, 16

Code execution parallelism

eval_metric

str

Evaluation metric

"pass_at_k", "bleu"

Different scoring methods

languages

list[str]

Programming languages

["python", "javascript"]

Language-specific evaluation

Safety and Specialized Harnesses

Parameter

Type

Description

Example Values

Use Cases

probes

str

Garak security probes

"ansiescape.AnsiEscaped"

Security evaluation

detectors

str

Garak security detectors

"base.TriggerListDetector"

Security evaluation

generations

int

Number of generations per prompt

1, 5, 10

Safety evaluation

Parameter Selection Guidelines#

  • Configure parallelism and request_timeout based on server capacity.

  • Use limit_samples for subset evaluation (e.g. for debugging or quick validation).

Common Configuration Errors#

Tokenizer Issues#

Problem

Missing tokenizer for log-probability tasks

# Incorrect - missing tokenizer
params = ConfigParams(extra={})

Solution

Always specify tokenizer for log-probability tasks

# Correct
params = ConfigParams(
    extra={
        "tokenizer_backend": "huggingface",
        "tokenizer": "/path/to/nemo_tokenizer"
    }
)

Performance Issues#

Problem

Excessive parallelism overwhelming server

# Incorrect - too many concurrent requests
params = ConfigParams(parallelism=100)

Solution

Start conservative and scale up

# Correct - reasonable concurrency
params = ConfigParams(parallelism=8, max_retries=3)

Parameter Conflicts#

Problem

Mixing generation and log-probability parameters

# Incorrect - generation params unused for log-probability
params = ConfigParams(
    temperature=0.7,  # Ignored for log-probability tasks
    extra={"tokenizer": "/path"}
)

Solution

Use appropriate parameters for task type

# Correct - only relevant parameters
params = ConfigParams(
    limit_samples=100,  # Relevant for all tasks
    extra={"tokenizer": "/path"}  # Required for log-probability
)

Next Steps#