Evaluation Configuration Parameters#

Comprehensive reference for configuring evaluation tasks in NeMo Evaluator, covering universal parameters, framework-specific settings, and optimization patterns.

Quick Navigation

Looking for task-specific guides?

Looking for available benchmarks?

Need help getting started?

Overview#

All evaluation tasks in NeMo Evaluator use the ConfigParams class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the extra parameter.

from nemo_evaluator.api.api_dataclasses import ConfigParams

# Basic configuration
params = ConfigParams(
    temperature=0,
    top_p=1.0,
    max_new_tokens=256,
    limit_samples=100
)

# Advanced configuration with framework-specific parameters
params = ConfigParams(
    temperature=0,
    parallelism=8,
    extra={
        "num_fewshot": 5,
        "tokenizer": "/path/to/tokenizer",
        "custom_prompt": "Answer the question:"
    }
)

Universal Parameters#

These parameters are available for all evaluation tasks regardless of the underlying harness or benchmark.

Core Generation Parameters#

Parameter

Type

Description

Example Values

Notes

temperature

float

Sampling randomness

0 (deterministic), 0.7 (creative)

Use 0 for reproducible results

top_p

float

Nucleus sampling threshold

1.0 (disabled), 0.9 (selective)

Controls diversity of generated text

max_new_tokens

int

Maximum response length

256, 512, 1024

Limits generation length

Evaluation Control Parameters#

Parameter

Type

Description

Example Values

Notes

limit_samples

int/float

Evaluation subset size

100 (count), 0.1 (10% of dataset)

Use for quick testing or resource limits

task

str

Task-specific identifier

"custom_task"

Used by some harnesses for task routing

Performance Parameters#

Parameter

Type

Description

Example Values

Notes

parallelism

int

Concurrent request threads

1, 8, 16

Balance against server capacity

max_retries

int

Retry attempts for failed requests

3, 5, 10

Increases robustness for network issues

request_timeout

int

Request timeout (seconds)

60, 120, 300

Adjust for model response time

Framework-Specific Parameters#

Framework-specific parameters are passed through the extra dictionary within ConfigParams.

LM-Evaluation-Harness Parameters

Parameter

Type

Description

Example Values

Use Cases

num_fewshot

int

Few-shot examples count

0, 5, 25

Academic benchmarks

tokenizer

str

Tokenizer path

"/path/to/tokenizer"

Log-probability tasks

tokenizer_backend

str

Tokenizer implementation

"huggingface", "sentencepiece"

Custom tokenizer setups

trust_remote_code

bool

Allow remote code execution

True, False

For custom tokenizers

add_bos_token

bool

Add beginning-of-sequence token

True, False

Model-specific formatting

add_eos_token

bool

Add end-of-sequence token

True, False

Model-specific formatting

fewshot_delimiter

str

Separator between examples

"\\n\\n", "\\n---\\n"

Custom prompt formatting

fewshot_seed

int

Reproducible example selection

42, 1337

Ensures consistent few-shot examples

description

str

Custom prompt prefix

"Answer the question:"

Task-specific instructions

bootstrap_iters

int

Statistical bootstrap iterations

1000, 10000

For confidence intervals

Simple-Evals Parameters

Parameter

Type

Description

Example Values

Use Cases

pass_at_k

list[int]

Code evaluation metrics

[1, 5, 10]

Code generation tasks

timeout

int

Code execution timeout

5, 10, 30

Code generation tasks

max_workers

int

Parallel execution workers

4, 8, 16

Code execution parallelism

languages

list[str]

Target programming languages

["python", "java", "cpp"]

Multi-language evaluation

BigCode-Evaluation-Harness Parameters

Parameter

Type

Description

Example Values

Use Cases

num_workers

int

Parallel execution workers

4, 8, 16

Code execution parallelism

eval_metric

str

Evaluation metric

"pass_at_k", "bleu"

Different scoring methods

languages

list[str]

Programming languages

["python", "javascript"]

Language-specific evaluation

Safety and Specialized Harnesses

Parameter

Type

Description

Example Values

Use Cases

probes

str

Garak security probes

"ansiescape.AnsiEscaped"

Security evaluation

detectors

str

Garak security detectors

"base.TriggerListDetector"

Security evaluation

generations

int

Number of generations per prompt

1, 5, 10

Safety evaluation

Configuration Patterns#

Academic Benchmarks (Deterministic)
academic_params = ConfigParams(
    temperature=0.01,      # Near-deterministic generation (0.0 not supported by all endpoints)
    top_p=1.0,             # No nucleus sampling
    max_new_tokens=256,    # Moderate response length
    limit_samples=None,    # Full dataset evaluation
    parallelism=4,         # Conservative parallelism
    extra={
        "num_fewshot": 5,  # Standard few-shot count
        "fewshot_seed": 42 # Reproducible examples
    }
)
Creative Tasks (Controlled Randomness)
creative_params = ConfigParams(
    temperature=0.7,       # Moderate creativity
    top_p=0.9,            # Nucleus sampling
    max_new_tokens=512,   # Longer responses
    extra={
        "repetition_penalty": 1.1,  # Reduce repetition
        "do_sample": True          # Enable sampling
    }
)
Code Generation (Balanced)
code_params = ConfigParams(
    temperature=0.2,       # Slight randomness for diversity
    top_p=0.95,           # Selective sampling
    max_new_tokens=1024,  # Sufficient for code solutions
    extra={
        "pass_at_k": [1, 5, 10],      # Multiple success metrics
        "timeout": 10,                # Code execution timeout
        "stop_sequences": ["```", "\\n\\n"]  # Code block terminators
    }
)
Log-Probability Tasks
logprob_params = ConfigParams(
    # No generation parameters needed for log-probability tasks
    limit_samples=100,    # Quick testing
    extra={
        "tokenizer_backend": "huggingface",
        "tokenizer": "/path/to/nemo_tokenizer",
        "trust_remote_code": True
    }
)
High-Throughput Evaluation
performance_params = ConfigParams(
    temperature=0.01,      # Near-deterministic for speed
    parallelism=16,       # High concurrency
    max_retries=5,        # Robust retry policy
    request_timeout=120,  # Generous timeout
    limit_samples=0.1,    # 10% sample for testing
    extra={
        "batch_size": 8,          # Batch requests if supported
        "cache_requests": True    # Enable caching
    }
)

Parameter Selection Guidelines#

By Evaluation Type#

Text Generation Tasks:

  • Use temperature=0.01 for near-deterministic, reproducible results (most endpoints don’t support exactly 0.0)

  • Set appropriate max_new_tokens based on expected response length

  • Configure parallelism based on server capacity

Log-Probability Tasks:

  • Always specify tokenizer and tokenizer_backend in extra

  • Generation parameters (temperature, top_p) are not used

  • Focus on tokenizer configuration accuracy

Code Generation Tasks:

  • Use moderate temperature (0.1-0.3) for diversity without randomness

  • Set higher max_new_tokens (1024+) for complete solutions

  • Configure timeout and pass_at_k in extra

Safety Evaluation:

  • Use appropriate probes and detectors in extra

  • Consider multiple generations per prompt

  • Use chat endpoints for instruction-following safety tests

By Resource Constraints#

Limited Compute:

  • Reduce parallelism to 1-4

  • Use limit_samples for subset evaluation

  • Increase request_timeout for slower responses

High-Performance Clusters:

  • Increase parallelism to 16-32

  • Enable request batching in extra if supported

  • Use full dataset evaluation (limit_samples=None)

Development/Testing:

  • Use limit_samples=10-100 for quick validation

  • Set temperature=0.01 for consistent results

  • Enable verbose logging in extra if available

Common Configuration Errors#

Tokenizer Issues#

Problem

Missing tokenizer for log-probability tasks

# Incorrect - missing tokenizer
params = ConfigParams(extra={})

Solution

Always specify tokenizer for log-probability tasks

# Correct
params = ConfigParams(
    extra={
        "tokenizer_backend": "huggingface",
        "tokenizer": "/path/to/nemo_tokenizer"
    }
)

Performance Issues#

Problem

Excessive parallelism overwhelming server

# Incorrect - too many concurrent requests
params = ConfigParams(parallelism=100)

Solution

Start conservative and scale up

# Correct - reasonable concurrency
params = ConfigParams(parallelism=8, max_retries=3)

Parameter Conflicts#

Problem

Mixing generation and log-probability parameters

# Incorrect - generation params unused for log-probability
params = ConfigParams(
    temperature=0.7,  # Ignored for log-probability tasks
    extra={"tokenizer": "/path"}
)

Solution

Use appropriate parameters for task type

# Correct - only relevant parameters
params = ConfigParams(
    limit_samples=100,  # Relevant for all tasks
    extra={"tokenizer": "/path"}  # Required for log-probability
)

Best Practices#

Development Workflow#

  1. Start Small: Use limit_samples=10 for initial validation

  2. Test Configuration: Verify parameters work before full evaluation

  3. Monitor Resources: Check memory and compute usage during evaluation

  4. Document Settings: Record successful configurations for reproducibility

Production Evaluation#

  1. Deterministic Settings: Use temperature=0.01 for consistent results

  2. Full Datasets: Remove limit_samples for complete evaluation

  3. Robust Configuration: Set appropriate retries and timeouts

  4. Resource Planning: Scale parallelism based on available infrastructure

Parameter Tuning#

  1. Task-Appropriate: Match parameters to evaluation methodology

  2. Incremental Changes: Adjust one parameter at a time

  3. Baseline Comparison: Compare against known good configurations

  4. Performance Monitoring: Track evaluation speed and resource usage

Next Steps#