Evaluation Configuration#
Evaluation configuration defines which benchmarks to run and their configuration. It is common for all executors and can be reused between them to launch the exact same tasks.
Important: Each task has its own default values that you can override. For comprehensive override options, see Parameter Overrides.
Configuration Structure#
evaluation:
overrides: # Global overrides for all tasks
config.params.request_timeout: 3600
tasks:
- name: task_name # Use default benchmark configuration
- name: another_task
overrides: # Task-specific overrides
config.params.temperature: 0.6
config.params.top_p: 0.95
env_vars: # Task-specific environment variables
HF_TOKEN: MY_HF_TOKEN
Key Components#
Global Overrides#
overrides
: Parameter overrides that apply to all tasksenv_vars
: Environment variables that apply to all tasks
Task Configuration#
tasks
: List of evaluation tasks to runname
: Name of the benchmark taskoverrides
: Task-specific parameter overridesenv_vars
: Task-specific environment variables
For a comprehensive list of available tasks, their descriptions, and task-specific parameters, see NeMo Evaluator Containers.
Advanced Task Configuration#
Parameter Overrides#
The overrides system is crucial for leveraging the full flexibility of the common endpoint interceptors and task configuration layer. This is where nemo-evaluator intersects with nemo-evaluator-launcher, providing a unified configuration interface.
Global Overrides#
Settings applied to all tasks listed in the config.
evaluation:
overrides:
config.params.request_timeout: 3600
config.params.temperature: 0.7
Task-Specific Overrides#
Parameters passed to a job for a single task. They take precedence over global evaluation settings.
evaluation:
tasks:
- name: gpqa_diamond
overrides:
config.params.temperature: 0.6
config.params.top_p: 0.95
config.params.max_new_tokens: 8192
config.params.parallelism: 32
- name: mbpp
overrides:
config.params.temperature: 0.2
config.params.top_p: 0.95
config.params.max_new_tokens: 2048
config.params.extra.n_samples: 5
Environment Variables#
Task-specifi environment variables. These parameters are set for a single job and don’t affect other tasks:
evaluation:
tasks:
- name: task_name1
# HF_TOKEN and CUSTOM_VAR are available for task_name1
env_vars:
HF_TOKEN: MY_HF_TOKEN
CUSTOM_VAR: CUSTOM_VALUE
- name: task_name2 # HF_TOKEN and CUSTOM_VAR are not set for task_name2
When to Use#
Use evaluation configuration when you want to:
Change Default Sampling Parameters: Adjust temperature, top_p, max_new_tokens for different tasks
Change Default Task Values: Override benchmark-specific default configurations
Configure Task-Specific Parameters: Set custom parameters for individual benchmarks (e.g., n_samples for code generation tasks)
Debug and Test: Launch with limited samples for validation
Adjust Endpoint Capabilities: Configure request timeouts, max retries, and parallel request limits
Tip
For overriding long strings, use YAML multiline syntax with >-
:
config.params.extra.custom_field: >-
This is a long string that spans multiple lines
and will be passed as a single value with spaces
replacing the newlines.
This preserves formatting and allows for complex multi-line configurations.
Reference#
Parameter Overrides: Parameter Overrides - Complete guide to available parameters and override syntax
Adapter Configuration: For advanced request/response modification (system prompts, payload modification, reasoning handling), see Interceptors
Task Configuration: NeMo Evaluator - Complete nemo-evaluator documentation
Available Tasks: NeMo Evaluator Containers - Browse all available evaluation tasks and benchmarks