Evaluation Configuration Parameters#
Comprehensive reference for configuring evaluation tasks in NeMo Evaluator, covering universal parameters, framework-specific settings, and optimization patterns.
Quick Navigation
Looking for available benchmarks?
About Selecting Benchmarks - Browse available benchmarks by category
Need help getting started?
About Evaluation - Overview of evaluation workflows
Evaluation Techniques - Step-by-step evaluation guides
Overview#
All evaluation tasks in NeMo Evaluator use the ConfigParams class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the extra parameter. Default configuration (including which parameters a task uses) is defined in the Framework Definition File (FDF) for each framework; see Framework Definition File (FDF) for details.
from nemo_evaluator.api.api_dataclasses import ConfigParams
# Basic configuration
params = ConfigParams(
temperature=0,
top_p=1.0,
max_new_tokens=256,
limit_samples=100
)
# With framework-specific parameters (extra)
params = ConfigParams(
temperature=0,
parallelism=8,
extra={
"num_fewshot": 5,
"tokenizer": "/path/to/tokenizer",
"custom_prompt": "Answer the question:"
}
)
How to see possible parameters for a given task
Python API (core) — Get default params and which params a task uses. Use framework_name.task_name to avoid ambiguity when the same task name exists in multiple harnesses:
from nemo_evaluator.core.input import get_available_evaluations
# Returns (framework_evals_mapping, framework_defaults, all_eval_name_mapping)
framework_evals, _, _ = get_available_evaluations()
# Use framework_name.task_name (e.g. simple_evals.mmlu_pro) for a single task
framework_name, task_name = "simple_evals", "mmlu_pro"
eval_obj = framework_evals[framework_name][task_name]
# Default params for this task (ConfigParams / dict-like)
print(eval_obj.config.params)
# Command template shows which {{ config.params.* }} the task uses
print(eval_obj.command)
CLI (core) — List tasks, then show merged config (including params) for a task:
# List available tasks
nemo-evaluator ls
# Show full rendered config (including config.params) for a task without running
# Use framework_name.task_name (e.g. simple_evals.mmlu_pro) to avoid ambiguity
nemo-evaluator run_eval --eval_type simple_evals.mmlu_pro --model_id x --model_url https://example.com/v1/chat/completions --model_type chat --output_dir ./out --dry_run
The --dry_run output prints the merged configuration (YAML) and the rendered command, so you can see which parameters apply to that task.
Launcher — If you use the launcher, nemo-evaluator-launcher ls task <task_name> (or harness.task_name) prints task details including Defaults with config.params and config.params.extra. List all tasks with nemo-evaluator-launcher ls tasks.
Universal Parameters#
These parameters are standardized across all frameworks and share the same names and semantics. That does not mean every framework supports every parameter: each task’s command template only uses a subset. If you pass a parameter that the task does not use, you will see a warning like: “Configuration contains parameter(s) that are not used in the command template” (see validate_params_in_command in nemo_evaluator.core.utils).
Category |
Parameter |
Type |
Description |
Example Values |
Notes |
|---|---|---|---|---|---|
Sampling |
|
|
Sampling randomness |
|
Use |
Sampling |
|
|
Nucleus sampling threshold |
|
Controls diversity of generated text |
Sampling |
|
|
Maximum response length |
|
Limits generation length |
Evaluation control |
|
|
Evaluation subset size |
|
Use for quick testing or resource limits |
Evaluation control |
|
|
Task-specific identifier |
|
Used by some harnesses for task routing |
Performance |
|
|
Concurrent request threads |
|
Balance against server capacity |
Performance |
|
|
Retry attempts for failed requests |
|
Increases robustness for network issues |
Performance |
|
|
Request timeout (seconds) |
|
Adjust for model response time |
Framework-Specific Parameters#
Framework-specific parameters are passed through the extra dictionary within ConfigParams.
LM-Evaluation-Harness Parameters
Parameter |
Type |
Description |
Example Values |
Use Cases |
|---|---|---|---|---|
|
|
Few-shot examples count |
|
Academic benchmarks |
|
|
Tokenizer path |
|
Log-probability tasks |
|
|
Tokenizer implementation |
|
Custom tokenizer setups |
|
|
Allow remote code execution |
|
For custom tokenizers |
|
|
Add beginning-of-sequence token |
|
Model-specific formatting |
|
|
Add end-of-sequence token |
|
Model-specific formatting |
|
|
Separator between examples |
|
Custom prompt formatting |
|
|
Reproducible example selection |
|
Ensures consistent few-shot examples |
|
|
Custom prompt prefix |
|
Task-specific instructions |
|
|
Statistical bootstrap iterations |
|
For confidence intervals |
Simple-Evals Parameters
Parameter |
Type |
Description |
Example Values |
Use Cases |
|---|---|---|---|---|
|
|
Code evaluation metrics |
|
Code generation tasks |
|
|
Code execution timeout |
|
Code generation tasks |
|
|
Parallel execution workers |
|
Code execution parallelism |
|
|
Target programming languages |
|
Multi-language evaluation |
BigCode-Evaluation-Harness Parameters
Parameter |
Type |
Description |
Example Values |
Use Cases |
|---|---|---|---|---|
|
|
Parallel execution workers |
|
Code execution parallelism |
|
|
Evaluation metric |
|
Different scoring methods |
|
|
Programming languages |
|
Language-specific evaluation |
Safety and Specialized Harnesses
Parameter |
Type |
Description |
Example Values |
Use Cases |
|---|---|---|---|---|
|
|
Garak security probes |
|
Security evaluation |
|
|
Garak security detectors |
|
Security evaluation |
|
|
Number of generations per prompt |
|
Safety evaluation |
Parameter Selection Guidelines#
Configure
parallelismandrequest_timeoutbased on server capacity.Use
limit_samplesfor subset evaluation (e.g. for debugging or quick validation).
Common Configuration Errors#
Tokenizer Issues#
Problem
Missing tokenizer for log-probability tasks
# Incorrect - missing tokenizer
params = ConfigParams(extra={})
Solution
Always specify tokenizer for log-probability tasks
# Correct
params = ConfigParams(
extra={
"tokenizer_backend": "huggingface",
"tokenizer": "/path/to/nemo_tokenizer"
}
)
Performance Issues#
Problem
Excessive parallelism overwhelming server
# Incorrect - too many concurrent requests
params = ConfigParams(parallelism=100)
Solution
Start conservative and scale up
# Correct - reasonable concurrency
params = ConfigParams(parallelism=8, max_retries=3)
Parameter Conflicts#
Problem
Mixing generation and log-probability parameters
# Incorrect - generation params unused for log-probability
params = ConfigParams(
temperature=0.7, # Ignored for log-probability tasks
extra={"tokenizer": "/path"}
)
Solution
Use appropriate parameters for task type
# Correct - only relevant parameters
params = ConfigParams(
limit_samples=100, # Relevant for all tasks
extra={"tokenizer": "/path"} # Required for log-probability
)
Next Steps#
Basic Usage: See Text Generation Evaluation for getting started
Custom Tasks: Learn Tasks Not Explicitly Defined by Framework Definition File for specialized evaluations
Troubleshooting: Refer to Troubleshooting for common issues
Benchmarks: Browse About Selecting Benchmarks for task-specific recommendations