Evaluation Configuration Parameters#
Comprehensive reference for configuring evaluation tasks in NeMo Evaluator, covering universal parameters, framework-specific settings, and optimization patterns.
Quick Navigation
Looking for task-specific guides?
Text Generation Evaluation - Text generation evaluation
Log-Probability Evaluation - Log-probability evaluation
Code Generation Evaluation - Code generation evaluation
Safety and Security Evaluation - Safety and security evaluation
Looking for available benchmarks?
Benchmark Catalog - Browse available benchmarks by category
Need help getting started?
About Evaluation - Overview of evaluation workflows
Run Evaluations - Step-by-step evaluation guides
Overview#
All evaluation tasks in NeMo Evaluator use the ConfigParams
class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the extra
parameter.
from nemo_evaluator.api.api_dataclasses import ConfigParams
# Basic configuration
params = ConfigParams(
temperature=0,
top_p=1.0,
max_new_tokens=256,
limit_samples=100
)
# Advanced configuration with framework-specific parameters
params = ConfigParams(
temperature=0,
parallelism=8,
extra={
"num_fewshot": 5,
"tokenizer": "/path/to/tokenizer",
"custom_prompt": "Answer the question:"
}
)
Universal Parameters#
These parameters are available for all evaluation tasks regardless of the underlying harness or benchmark.
Core Generation Parameters#
Parameter |
Type |
Description |
Example Values |
Notes |
---|---|---|---|---|
|
|
Sampling randomness |
|
Use |
|
|
Nucleus sampling threshold |
|
Controls diversity of generated text |
|
|
Maximum response length |
|
Limits generation length |
Evaluation Control Parameters#
Parameter |
Type |
Description |
Example Values |
Notes |
---|---|---|---|---|
|
|
Evaluation subset size |
|
Use for quick testing or resource limits |
|
|
Task-specific identifier |
|
Used by some harnesses for task routing |
Performance Parameters#
Parameter |
Type |
Description |
Example Values |
Notes |
---|---|---|---|---|
|
|
Concurrent request threads |
|
Balance against server capacity |
|
|
Retry attempts for failed requests |
|
Increases robustness for network issues |
|
|
Request timeout (seconds) |
|
Adjust for model response time |
Framework-Specific Parameters#
Framework-specific parameters are passed through the extra
dictionary within ConfigParams
.
LM-Evaluation-Harness Parameters
Parameter |
Type |
Description |
Example Values |
Use Cases |
---|---|---|---|---|
|
|
Few-shot examples count |
|
Academic benchmarks |
|
|
Tokenizer path |
|
Log-probability tasks |
|
|
Tokenizer implementation |
|
Custom tokenizer setups |
|
|
Allow remote code execution |
|
For custom tokenizers |
|
|
Add beginning-of-sequence token |
|
Model-specific formatting |
|
|
Add end-of-sequence token |
|
Model-specific formatting |
|
|
Separator between examples |
|
Custom prompt formatting |
|
|
Reproducible example selection |
|
Ensures consistent few-shot examples |
|
|
Custom prompt prefix |
|
Task-specific instructions |
|
|
Statistical bootstrap iterations |
|
For confidence intervals |
Simple-Evals Parameters
Parameter |
Type |
Description |
Example Values |
Use Cases |
---|---|---|---|---|
|
|
Code evaluation metrics |
|
Code generation tasks |
|
|
Code execution timeout |
|
Code generation tasks |
|
|
Parallel execution workers |
|
Code execution parallelism |
|
|
Target programming languages |
|
Multi-language evaluation |
BigCode-Evaluation-Harness Parameters
Parameter |
Type |
Description |
Example Values |
Use Cases |
---|---|---|---|---|
|
|
Parallel execution workers |
|
Code execution parallelism |
|
|
Evaluation metric |
|
Different scoring methods |
|
|
Programming languages |
|
Language-specific evaluation |
Safety and Specialized Harnesses
Parameter |
Type |
Description |
Example Values |
Use Cases |
---|---|---|---|---|
|
|
Garak security probes |
|
Security evaluation |
|
|
Garak security detectors |
|
Security evaluation |
|
|
Number of generations per prompt |
|
Safety evaluation |
Configuration Patterns#
Academic Benchmarks (Deterministic)
academic_params = ConfigParams(
temperature=0.01, # Near-deterministic generation (0.0 not supported by all endpoints)
top_p=1.0, # No nucleus sampling
max_new_tokens=256, # Moderate response length
limit_samples=None, # Full dataset evaluation
parallelism=4, # Conservative parallelism
extra={
"num_fewshot": 5, # Standard few-shot count
"fewshot_seed": 42 # Reproducible examples
}
)
Creative Tasks (Controlled Randomness)
creative_params = ConfigParams(
temperature=0.7, # Moderate creativity
top_p=0.9, # Nucleus sampling
max_new_tokens=512, # Longer responses
extra={
"repetition_penalty": 1.1, # Reduce repetition
"do_sample": True # Enable sampling
}
)
Code Generation (Balanced)
code_params = ConfigParams(
temperature=0.2, # Slight randomness for diversity
top_p=0.95, # Selective sampling
max_new_tokens=1024, # Sufficient for code solutions
extra={
"pass_at_k": [1, 5, 10], # Multiple success metrics
"timeout": 10, # Code execution timeout
"stop_sequences": ["```", "\\n\\n"] # Code block terminators
}
)
Log-Probability Tasks
logprob_params = ConfigParams(
# No generation parameters needed for log-probability tasks
limit_samples=100, # Quick testing
extra={
"tokenizer_backend": "huggingface",
"tokenizer": "/path/to/nemo_tokenizer",
"trust_remote_code": True
}
)
High-Throughput Evaluation
performance_params = ConfigParams(
temperature=0.01, # Near-deterministic for speed
parallelism=16, # High concurrency
max_retries=5, # Robust retry policy
request_timeout=120, # Generous timeout
limit_samples=0.1, # 10% sample for testing
extra={
"batch_size": 8, # Batch requests if supported
"cache_requests": True # Enable caching
}
)
Parameter Selection Guidelines#
By Evaluation Type#
Text Generation Tasks:
Use
temperature=0.01
for near-deterministic, reproducible results (most endpoints don’t support exactly 0.0)Set appropriate
max_new_tokens
based on expected response lengthConfigure
parallelism
based on server capacity
Log-Probability Tasks:
Always specify
tokenizer
andtokenizer_backend
inextra
Generation parameters (temperature, top_p) are not used
Focus on tokenizer configuration accuracy
Code Generation Tasks:
Use moderate
temperature
(0.1-0.3) for diversity without randomnessSet higher
max_new_tokens
(1024+) for complete solutionsConfigure
timeout
andpass_at_k
inextra
Safety Evaluation:
Use appropriate
probes
anddetectors
inextra
Consider multiple
generations
per promptUse chat endpoints for instruction-following safety tests
By Resource Constraints#
Limited Compute:
Reduce
parallelism
to 1-4Use
limit_samples
for subset evaluationIncrease
request_timeout
for slower responses
High-Performance Clusters:
Increase
parallelism
to 16-32Enable request batching in
extra
if supportedUse full dataset evaluation (
limit_samples=None
)
Development/Testing:
Use
limit_samples=10-100
for quick validationSet
temperature=0.01
for consistent resultsEnable verbose logging in
extra
if available
Common Configuration Errors#
Tokenizer Issues#
Problem
Missing tokenizer for log-probability tasks
# Incorrect - missing tokenizer
params = ConfigParams(extra={})
Solution
Always specify tokenizer for log-probability tasks
# Correct
params = ConfigParams(
extra={
"tokenizer_backend": "huggingface",
"tokenizer": "/path/to/nemo_tokenizer"
}
)
Performance Issues#
Problem
Excessive parallelism overwhelming server
# Incorrect - too many concurrent requests
params = ConfigParams(parallelism=100)
Solution
Start conservative and scale up
# Correct - reasonable concurrency
params = ConfigParams(parallelism=8, max_retries=3)
Parameter Conflicts#
Problem
Mixing generation and log-probability parameters
# Incorrect - generation params unused for log-probability
params = ConfigParams(
temperature=0.7, # Ignored for log-probability tasks
extra={"tokenizer": "/path"}
)
Solution
Use appropriate parameters for task type
# Correct - only relevant parameters
params = ConfigParams(
limit_samples=100, # Relevant for all tasks
extra={"tokenizer": "/path"} # Required for log-probability
)
Best Practices#
Development Workflow#
Start Small: Use
limit_samples=10
for initial validationTest Configuration: Verify parameters work before full evaluation
Monitor Resources: Check memory and compute usage during evaluation
Document Settings: Record successful configurations for reproducibility
Production Evaluation#
Deterministic Settings: Use
temperature=0.01
for consistent resultsFull Datasets: Remove
limit_samples
for complete evaluationRobust Configuration: Set appropriate retries and timeouts
Resource Planning: Scale
parallelism
based on available infrastructure
Parameter Tuning#
Task-Appropriate: Match parameters to evaluation methodology
Incremental Changes: Adjust one parameter at a time
Baseline Comparison: Compare against known good configurations
Performance Monitoring: Track evaluation speed and resource usage
Next Steps#
Basic Usage: See Text Generation Evaluation for getting started
Custom Tasks: Learn Custom Task Evaluation for specialized evaluations
Troubleshooting: Refer to Troubleshooting for common issues
Benchmarks: Browse Benchmark Catalog for task-specific recommendations