Custom Task Evaluation#

Advanced guide for evaluating models on tasks without pre-defined configurations using custom benchmark definitions and configuration patterns.

Overview#

While NeMo Evaluator provides pre-configured tasks for common benchmarks, you may need to evaluate models on:

Research Benchmarks: Newly released datasets not yet integrated
Custom Datasets: Proprietary or domain-specific evaluation data
Task Variants: Modified versions of existing benchmarks with different settings
Specialized Configurations: Tasks requiring specific parameters or tokenizers

This guide demonstrates how to configure custom evaluations across multiple harnesses and optimization patterns.

When to Use Custom Tasks#

Choose Custom Tasks When:

Your target benchmark lacks a pre-defined configuration
You need specific few-shot settings different from defaults
Research requires non-standard evaluation parameters
Evaluating on proprietary or modified datasets

Use Pre-Defined Tasks When:

Standard benchmarks with optimal settings (refer to Text Generation Evaluation)
Quick prototyping and baseline comparisons
Following established evaluation protocols

Task Specification Format#

Custom tasks require explicit harness specification using the format:

"<harness_name>.<task_name>"

Examples:

"lm-evaluation-harness.lambada_openai" - LM-Eval harness task
"simple-evals.humaneval" - Simple-Evals harness task
"bigcode-evaluation-harness.humaneval" - BigCode harness task

Note

These examples demonstrate accessing tasks from upstream evaluation harnesses. Pre-configured tasks with optimized settings are available through the launcher CLI (nemo-evaluator-launcher ls tasks). Custom task configuration is useful when you need non-standard parameters or when evaluating tasks not yet integrated into the pre-configured catalog.

lambada_openai (Log-Probability Task)#

The lambada_openai task evaluates reading comprehension using log-probabilities.

pip install nvidia-lm-eval

Deploy your model:
```
python deploy.py
```
Configure and run the evaluation:

Key Configuration Notes:

Uses log-probabilities for evaluation (refer to Log-Probability Evaluation)
Requires tokenizer configuration for proper probability calculation
limit_samples=10 used for quick testing (remove for full evaluation)

Additional LM-Eval Tasks#

You can access additional tasks from the LM Evaluation Harness that may not have pre-defined configurations. For example, to evaluate perplexity or other log-probability tasks:

from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint, ConfigParams, EndpointType, EvaluationConfig, EvaluationTarget
)
from nemo_evaluator.core.evaluate import evaluate

# Configure evaluation for any lm-evaluation-harness task
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="http://0.0.0.0:8080/v1/completions/", 
        type=EndpointType.COMPLETIONS, 
        model_id="megatron_model"
    )
)

# Example: Using a custom task from lm-evaluation-harness
eval_config = EvaluationConfig(
    type="lm-evaluation-harness.<task_name>",
    params=ConfigParams(
        extra={
            "tokenizer_backend": "huggingface",
            "tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer"
        }
    ),
    output_dir="./custom-task-results"
)

results = evaluate(target_cfg=target_config, eval_cfg=eval_config)

Note

Replace <task_name> with any task available in the upstream LM Evaluation Harness. Not all upstream tasks have been tested or pre-configured. For pre-configured tasks, refer to Log-Probability Evaluation and Text Generation Evaluation.

HumanEval (Code Generation)#

Evaluate code generation capabilities using HumanEval:

# Install simple-evals framework
pip install nvidia-simple-evals

# Configure HumanEval evaluation
eval_config = EvaluationConfig(
    type="simple-evals.humaneval",
    params=ConfigParams(
        temperature=0.2,  # Slight randomness for code diversity
        max_new_tokens=512,  # Sufficient for code solutions
        limit_samples=20,  # Test subset
        extra={
            "pass_at_k": [1, 5, 10],  # Evaluate pass@1, pass@5, pass@10
            "timeout": 10  # Code execution timeout
        }
    ),
    output_dir="./humaneval-results"
)

Key Configuration Notes:

Uses chat endpoint for instruction-tuned models
Requires code execution environment
pass_at_k metrics measure success rates

For additional code generation tasks, refer to Code Generation Evaluation.

Advanced Configuration Patterns#

Configuration Reference#

For comprehensive parameter documentation including universal settings, framework-specific options, and optimization patterns, refer to Evaluation Configuration Parameters.

Key Custom Task Considerations#

When configuring custom tasks, pay special attention to:

Tokenizer Requirements: Log-probability tasks require tokenizer and tokenizer_backend in extra
Framework-Specific Parameters: Each harness supports different parameters in the extra dictionary
Performance Tuning: Adjust parallelism and timeout settings based on task complexity
Reproducibility: Use temperature=0 and set fewshot_seed for consistent results