Evaluation Configuration Parameters#

Comprehensive reference for configuring evaluation tasks in NeMo Evaluator, covering universal parameters, framework-specific settings, and optimization patterns.

Quick Navigation

Looking for task-specific guides?

Text Generation Evaluation - Text generation evaluation
Log-Probability Evaluation - Log-probability evaluation
Code Generation Evaluation - Code generation evaluation
Safety and Security Evaluation - Safety and security evaluation

Looking for available benchmarks?

Benchmark Catalog - Browse available benchmarks by category

Need help getting started?

About Evaluation - Overview of evaluation workflows
Run Evaluations - Step-by-step evaluation guides

Overview#

All evaluation tasks in NeMo Evaluator use the ConfigParams class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the extra parameter.

from nemo_evaluator.api.api_dataclasses import ConfigParams

# Basic configuration
params = ConfigParams(
    temperature=0,
    top_p=1.0,
    max_new_tokens=256,
    limit_samples=100
)

# Advanced configuration with framework-specific parameters
params = ConfigParams(
    temperature=0,
    parallelism=8,
    extra={
        "num_fewshot": 5,
        "tokenizer": "/path/to/tokenizer",
        "custom_prompt": "Answer the question:"
    }
)

Universal Parameters#

These parameters are available for all evaluation tasks regardless of the underlying harness or benchmark.

Core Generation Parameters#

Parameter	Type	Description	Example Values	Notes
`temperature`	`float`	Sampling randomness	`0` (deterministic), `0.7` (creative)	Use `0` for reproducible results
`top_p`	`float`	Nucleus sampling threshold	`1.0` (disabled), `0.9` (selective)	Controls diversity of generated text
`max_new_tokens`	`int`	Maximum response length	`256`, `512`, `1024`	Limits generation length

Evaluation Control Parameters#

Parameter	Type	Description	Example Values	Notes
`limit_samples`	`int/float`	Evaluation subset size	`100` (count), `0.1` (10% of dataset)	Use for quick testing or resource limits
`task`	`str`	Task-specific identifier	`"custom_task"`	Used by some harnesses for task routing

Performance Parameters#

Parameter	Type	Description	Example Values	Notes
`parallelism`	`int`	Concurrent request threads	`1`, `8`, `16`	Balance against server capacity
`max_retries`	`int`	Retry attempts for failed requests	`3`, `5`, `10`	Increases robustness for network issues
`request_timeout`	`int`	Request timeout (seconds)	`60`, `120`, `300`	Adjust for model response time

Framework-Specific Parameters#

Framework-specific parameters are passed through the extra dictionary within ConfigParams.

LM-Evaluation-Harness Parameters

Parameter	Type	Description	Example Values	Use Cases
`num_fewshot`	`int`	Few-shot examples count	`0`, `5`, `25`	Academic benchmarks
`tokenizer`	`str`	Tokenizer path	`"/path/to/tokenizer"`	Log-probability tasks
`tokenizer_backend`	`str`	Tokenizer implementation	`"huggingface"`, `"sentencepiece"`	Custom tokenizer setups
`trust_remote_code`	`bool`	Allow remote code execution	`True`, `False`	For custom tokenizers
`add_bos_token`	`bool`	Add beginning-of-sequence token	`True`, `False`	Model-specific formatting
`add_eos_token`	`bool`	Add end-of-sequence token	`True`, `False`	Model-specific formatting
`fewshot_delimiter`	`str`	Separator between examples	`"\\n\\n"`, `"\\n---\\n"`	Custom prompt formatting
`fewshot_seed`	`int`	Reproducible example selection	`42`, `1337`	Ensures consistent few-shot examples
`description`	`str`	Custom prompt prefix	`"Answer the question:"`	Task-specific instructions
`bootstrap_iters`	`int`	Statistical bootstrap iterations	`1000`, `10000`	For confidence intervals

Simple-Evals Parameters

Parameter	Type	Description	Example Values	Use Cases
`pass_at_k`	`list[int]`	Code evaluation metrics	`[1, 5, 10]`	Code generation tasks
`timeout`	`int`	Code execution timeout	`5`, `10`, `30`	Code generation tasks
`max_workers`	`int`	Parallel execution workers	`4`, `8`, `16`	Code execution parallelism
`languages`	`list[str]`	Target programming languages	`["python", "java", "cpp"]`	Multi-language evaluation

BigCode-Evaluation-Harness Parameters

Parameter	Type	Description	Example Values	Use Cases
`num_workers`	`int`	Parallel execution workers	`4`, `8`, `16`	Code execution parallelism
`eval_metric`	`str`	Evaluation metric	`"pass_at_k"`, `"bleu"`	Different scoring methods
`languages`	`list[str]`	Programming languages	`["python", "javascript"]`	Language-specific evaluation

Safety and Specialized Harnesses

Parameter	Type	Description	Example Values	Use Cases
`probes`	`str`	Garak security probes	`"ansiescape.AnsiEscaped"`	Security evaluation
`detectors`	`str`	Garak security detectors	`"base.TriggerListDetector"`	Security evaluation
`generations`	`int`	Number of generations per prompt	`1`, `5`, `10`	Safety evaluation

Configuration Patterns#

Parameter Selection Guidelines#

By Evaluation Type#

Text Generation Tasks:

Use temperature=0.01 for near-deterministic, reproducible results (most endpoints don’t support exactly 0.0)
Set appropriate max_new_tokens based on expected response length
Configure parallelism based on server capacity

Log-Probability Tasks:

Always specify tokenizer and tokenizer_backend in extra
Generation parameters (temperature, top_p) are not used
Focus on tokenizer configuration accuracy

Code Generation Tasks:

Use moderate temperature (0.1-0.3) for diversity without randomness
Set higher max_new_tokens (1024+) for complete solutions
Configure timeout and pass_at_k in extra

Safety Evaluation:

Use appropriate probes and detectors in extra
Consider multiple generations per prompt
Use chat endpoints for instruction-following safety tests

By Resource Constraints#

Limited Compute:

Reduce parallelism to 1-4
Use limit_samples for subset evaluation
Increase request_timeout for slower responses

High-Performance Clusters:

Increase parallelism to 16-32
Enable request batching in extra if supported
Use full dataset evaluation (limit_samples=None)

Development/Testing:

Use limit_samples=10-100 for quick validation
Set temperature=0.01 for consistent results
Enable verbose logging in extra if available

Common Configuration Errors#

Tokenizer Issues#

Problem

Missing tokenizer for log-probability tasks

# Incorrect - missing tokenizer
params = ConfigParams(extra={})

Solution

Always specify tokenizer for log-probability tasks

# Correct
params = ConfigParams(
    extra={
        "tokenizer_backend": "huggingface",
        "tokenizer": "/path/to/nemo_tokenizer"
    }
)

Performance Issues#

Problem

Excessive parallelism overwhelming server

# Incorrect - too many concurrent requests
params = ConfigParams(parallelism=100)

Solution

Start conservative and scale up

# Correct - reasonable concurrency
params = ConfigParams(parallelism=8, max_retries=3)

Parameter Conflicts#

Problem

Mixing generation and log-probability parameters

# Incorrect - generation params unused for log-probability
params = ConfigParams(
    temperature=0.7,  # Ignored for log-probability tasks
    extra={"tokenizer": "/path"}
)

Solution

Use appropriate parameters for task type

# Correct - only relevant parameters
params = ConfigParams(
    limit_samples=100,  # Relevant for all tasks
    extra={"tokenizer": "/path"}  # Required for log-probability
)

Best Practices#

Development Workflow#

Start Small: Use limit_samples=10 for initial validation
Test Configuration: Verify parameters work before full evaluation
Monitor Resources: Check memory and compute usage during evaluation
Document Settings: Record successful configurations for reproducibility

Production Evaluation#

Deterministic Settings: Use temperature=0.01 for consistent results
Full Datasets: Remove limit_samples for complete evaluation
Robust Configuration: Set appropriate retries and timeouts
Resource Planning: Scale parallelism based on available infrastructure

Parameter Tuning#

Task-Appropriate: Match parameters to evaluation methodology
Incremental Changes: Adjust one parameter at a time
Baseline Comparison: Compare against known good configurations
Performance Monitoring: Track evaluation speed and resource usage

Next Steps#

Basic Usage: See Text Generation Evaluation for getting started
Custom Tasks: Learn Custom Task Evaluation for specialized evaluations
Troubleshooting: Refer to Troubleshooting for common issues
Benchmarks: Browse Benchmark Catalog for task-specific recommendations