Evaluation Configuration#
Evaluation configuration defines which benchmarks to run and their configuration. It is common for all executors and can be reused between them to launch the exact same tasks.
Important: Each task has its own default values that you can override. For comprehensive override options, see Parameter Overrides.
Configuration Structure#
evaluation:
nemo_evaluator_config: # Global overrides for all tasks
config:
params:
request_timeout: 3600
tasks:
- name: task_name # Use default benchmark configuration
- name: another_task
nemo_evaluator_config: # Task-specific overrides
config:
params: # Task-specific overrides
temperature: 0.6
top_p: 0.95
env_vars: # Task-specific environment variables
HF_TOKEN: $host:HF_TOKEN
Key Components#
Global Overrides#
overrides: Parameter overrides that apply to all tasksenv_vars: Environment variables that apply to all tasks
Task Configuration#
tasks: List of evaluation tasks to runname: Name of the benchmark taskoverrides: Task-specific parameter overridesenv_vars: Task-specific environment variables
For a comprehensive list of available tasks, their descriptions, and task-specific parameters, see NeMo Evaluator Containers.
Advanced Task Configuration#
Parameter Overrides#
The overrides system is crucial for leveraging the full flexibility of the common endpoint interceptors and task configuration layer. This is where nemo-evaluator intersects with nemo-evaluator-launcher, providing a unified configuration interface.
Global Overrides#
Settings applied to all tasks listed in the config.
evaluation:
nemo_evaluator_config:
config:
params:
request_timeout: 3600
temperature: 0.7
Task-Specific Overrides#
Parameters passed to a job for a single task. They take precedence over global evaluation settings.
evaluation:
tasks:
- name: gpqa_diamond
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 8192
parallelism: 32
- name: mbpp
nemo_evaluator_config:
config:
params:
temperature: 0.2
top_p: 0.95
max_new_tokens: 2048
extra:
n_samples: 5
Environment Variables#
Environment variables can be declared at multiple levels. Values at more specific levels override broader ones (last wins):
# 1. Top-level — applies to ALL jobs (deployment + evaluation)
env_vars:
HF_TOKEN: $host:HF_TOKEN
CACHE_DIR: $lit:/cache/huggingface
# 2. Evaluation-level — applies to all evaluation tasks
evaluation:
env_vars:
CUSTOM_VAR: $lit:some_value
# 3. Task-level — applies to a single task only
tasks:
- name: task_name1
env_vars:
HF_TOKEN: $host:HF_TOKEN_FOR_GPQA_DIAMOND # overrides top-level
- name: task_name2 # inherits top-level HF_TOKEN
Value Prefixes#
Every value must use one of three explicit prefixes:
Prefix |
Meaning |
Example |
|---|---|---|
|
Resolved from the host environment (or |
|
|
Literal value, written as-is. Use for paths, URLs, flags. |
|
|
Late-bound — resolved by the execution environment at runtime (e.g., a variable set by SLURM or the deployment container). |
|
Bare (unprefixed) values still work for backward compatibility but emit deprecation warnings.
Tip
Use the migration script to automatically add prefixes to existing configs:
python scripts/migrate_config.py your_config.yaml # preview
python scripts/migrate_config.py your_config.yaml --write # overwrite
API Key (api_key_name)#
The target.api_endpoint.api_key_name field specifies which host environment variable holds the API key for the model endpoint. The launcher automatically includes it in the evaluation environment — you do not need to add it to env_vars manually:
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.1-8b-instruct
api_key_name: NGC_API_KEY # resolved from $NGC_API_KEY on the host
If you need to override the API key for a specific task, declare it explicitly in that task’s env_vars.
Loading from .env Files#
The launcher can load environment variables from a .env file before resolving $host: references. This is useful for keeping secrets out of your shell history:
# Loads $PWD/.env by default (if it exists)
nemo-evaluator-launcher run --config config.yaml
# Or specify a path explicitly
nemo-evaluator-launcher run --config config.yaml --env-file /path/to/.env
Variables already set in the shell environment take precedence over .env file values.
Secrets Handling#
Secrets ($host: values) are never written into generated scripts (run.sh, run.sub). Instead, they are stored in a separate .secrets.env file alongside the script and sourced at runtime. This prevents accidental exposure in logs, artifacts, and dry-run output.
Dataset Directory Mounting#
Some evaluation tasks require access to local datasets that must be mounted into the evaluation container. Tasks that require dataset mounting will have NEMO_EVALUATOR_DATASET_DIR in their required_env_vars.
When using such tasks, you must specify:
dataset_dir: Path to the dataset on the host machinedataset_mount_path(optional): Path where the dataset should be mounted inside the container (defaults to/datasets)
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /path/to/your/techqa/dataset
# dataset_mount_path: /datasets # Optional, defaults to /datasets
The system will:
Mount the host path (
dataset_dir) to the container path (dataset_mount_path)Automatically set the
NEMO_EVALUATOR_DATASET_DIRenvironment variable to point to the mounted path inside the containerValidate that the required environment variable is properly configured
Example with custom mount path:
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /mnt/data/techqa
dataset_mount_path: /data/techqa # Custom container path
When to Use#
Use evaluation configuration when you want to:
Change Default Sampling Parameters: Adjust temperature, top_p, max_new_tokens for different tasks
Change Default Task Values: Override benchmark-specific default configurations
Configure Task-Specific Parameters: Set custom parameters for individual benchmarks (e.g., n_samples for code generation tasks)
Debug and Test: Launch with limited samples for validation
Adjust Endpoint Capabilities: Configure request timeouts, max retries, and parallel request limits
Tip
For overriding long strings, use YAML multiline syntax with >-:
config.params.extra.custom_field: >-
This is a long string that spans multiple lines
and will be passed as a single value with spaces
replacing the newlines.
This preserves formatting and allows for complex multi-line configurations.
Reference#
Parameter Overrides: Parameter Overrides - Complete guide to available parameters and override syntax
Adapter Configuration: For advanced request/response modification (system prompts, payload modification, reasoning handling), see Interceptors
Task Configuration: NeMo Evaluator - Complete nemo-evaluator documentation
Available Tasks: NeMo Evaluator Containers - Browse all available evaluation tasks and benchmarks