Custom Task Evaluation#
Advanced guide for evaluating models on tasks without pre-defined configurations using custom benchmark definitions and configuration patterns.
Overview#
While NeMo Evaluator provides pre-configured tasks for common benchmarks, you may need to evaluate models on:
Research Benchmarks: Newly released datasets not yet integrated
Custom Datasets: Proprietary or domain-specific evaluation data
Task Variants: Modified versions of existing benchmarks with different settings
Specialized Configurations: Tasks requiring specific parameters or tokenizers
This guide demonstrates how to configure custom evaluations across multiple harnesses and optimization patterns.
When to Use Custom Tasks#
Choose Custom Tasks When:
Your target benchmark lacks a pre-defined configuration
You need specific few-shot settings different from defaults
Research requires non-standard evaluation parameters
Evaluating on proprietary or modified datasets
Use Pre-Defined Tasks When:
Standard benchmarks with optimal settings (refer to Text Generation Evaluation)
Quick prototyping and baseline comparisons
Following established evaluation protocols
Task Specification Format#
Custom tasks require explicit harness specification using the format:
"<harness_name>.<task_name>"
Examples:
"lm-evaluation-harness.lambada_openai"
- LM-Eval harness task"simple-evals.humaneval"
- Simple-Evals harness task"bigcode-evaluation-harness.humaneval"
- BigCode harness task
Note
These examples demonstrate accessing tasks from upstream evaluation harnesses. Pre-configured tasks with optimized settings are available through the launcher CLI (nemo-evaluator-launcher ls tasks
). Custom task configuration is useful when you need non-standard parameters or when evaluating tasks not yet integrated into the pre-configured catalog.
lambada_openai (Log-Probability Task)#
The lambada_openai
task evaluates reading comprehension using log-probabilities.
pip install nvidia-lm-eval
Deploy your model:
python deploy.py
Configure and run the evaluation:
Key Configuration Notes:
Uses log-probabilities for evaluation (refer to Log-Probability Evaluation)
Requires tokenizer configuration for proper probability calculation
limit_samples=10
used for quick testing (remove for full evaluation)
Additional LM-Eval Tasks#
You can access additional tasks from the LM Evaluation Harness that may not have pre-defined configurations. For example, to evaluate perplexity or other log-probability tasks:
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint, ConfigParams, EndpointType, EvaluationConfig, EvaluationTarget
)
from nemo_evaluator.core.evaluate import evaluate
# Configure evaluation for any lm-evaluation-harness task
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type=EndpointType.COMPLETIONS,
model_id="megatron_model"
)
)
# Example: Using a custom task from lm-evaluation-harness
eval_config = EvaluationConfig(
type="lm-evaluation-harness.<task_name>",
params=ConfigParams(
extra={
"tokenizer_backend": "huggingface",
"tokenizer": "/checkpoints/llama-3_2-1b-instruct_v2.0/context/nemo_tokenizer"
}
),
output_dir="./custom-task-results"
)
results = evaluate(target_cfg=target_config, eval_cfg=eval_config)
Note
Replace <task_name>
with any task available in the upstream LM Evaluation Harness. Not all upstream tasks have been tested or pre-configured. For pre-configured tasks, refer to Log-Probability Evaluation and Text Generation Evaluation.
HumanEval (Code Generation)#
Evaluate code generation capabilities using HumanEval:
# Install simple-evals framework
pip install nvidia-simple-evals
# Configure HumanEval evaluation
eval_config = EvaluationConfig(
type="simple-evals.humaneval",
params=ConfigParams(
temperature=0.2, # Slight randomness for code diversity
max_new_tokens=512, # Sufficient for code solutions
limit_samples=20, # Test subset
extra={
"pass_at_k": [1, 5, 10], # Evaluate pass@1, pass@5, pass@10
"timeout": 10 # Code execution timeout
}
),
output_dir="./humaneval-results"
)
Key Configuration Notes:
Uses chat endpoint for instruction-tuned models
Requires code execution environment
pass_at_k
metrics measure success rates
For additional code generation tasks, refer to Code Generation Evaluation.
Advanced Configuration Patterns#
Custom Few-Shot Configuration
# Configure custom few-shot settings
params = ConfigParams(
limit_samples=100,
extra={
"num_fewshot": 5, # Number of examples in prompt
"fewshot_delimiter": "\\n\\n", # Separator between examples
"fewshot_seed": 42, # Reproducible example selection
"description": "Answer the following question:", # Custom prompt prefix
}
)
Performance Optimization
# Optimize for high-throughput evaluation
params = ConfigParams(
parallelism=16, # Concurrent request threads
max_retries=5, # Retry failed requests
request_timeout=120, # Timeout per request (seconds)
temperature=0, # Deterministic for reproducibility
extra={
"batch_size": 8, # Requests per batch (if supported)
"cache_requests": True # Enable request caching
}
)
Custom Tokenizer Configuration
# Configure task-specific tokenizers
params = ConfigParams(
extra={
# Hugging Face tokenizer
"tokenizer_backend": "huggingface",
"tokenizer": "/path/to/nemo_tokenizer",
# Alternative: Direct tokenizer specification
"tokenizer_name": "meta-llama/Llama-2-7b-hf",
"add_bos_token": True,
"add_eos_token": False,
# Trust remote code for custom tokenizers
"trust_remote_code": True
}
)
Task-Specific Generation Settings
# Configure generation for different task types
# Academic benchmarks (deterministic)
academic_params = ConfigParams(
temperature=0,
top_p=1.0,
max_new_tokens=256,
extra={"do_sample": False}
)
# Creative tasks (controlled randomness)
creative_params = ConfigParams(
temperature=0.7,
top_p=0.9,
max_new_tokens=512,
extra={"repetition_penalty": 1.1}
)
# Code generation (balanced)
code_params = ConfigParams(
temperature=0.2,
top_p=0.95,
max_new_tokens=1024,
extra={"stop_sequences": ["```", "\\n\\n"]}
)
Configuration Reference#
For comprehensive parameter documentation including universal settings, framework-specific options, and optimization patterns, refer to Evaluation Configuration Parameters.
Key Custom Task Considerations#
When configuring custom tasks, pay special attention to:
Tokenizer Requirements: Log-probability tasks require
tokenizer
andtokenizer_backend
inextra
Framework-Specific Parameters: Each harness supports different parameters in the
extra
dictionaryPerformance Tuning: Adjust
parallelism
and timeout settings based on task complexityReproducibility: Use
temperature=0
and setfewshot_seed
for consistent results