Evaluation Output#
This page describes the structure and content of evaluation output files generated by NVIDIA NeMo Evaluator. The evaluation output provides comprehensive information about the evaluation run, including configuration details, results, and metadata.
Input Configuration#
The input configuration comes from the command described in the Launcher Quickstart Guide, namely
# Run a quick test evaluation with limited samples
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct_limit_samples \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o target.api_endpoint.api_key_name=NGC_API_KEY \
-o execution.output_dir=./results
After running it, you can copy the artifacts folder using
nemo-evaluator-launcher debug <invocation_id> --copy-artifacts <DIR>
and find this file under the the ./artifacts
subfolder.
Note
There are several ways to retrieve the artifacts. Above is the way that works across
executors, including e.g. slurm
, by downloading them.
For the reference purposes, we cite here the launcher config that is used in the command:
# specify default configs for execution and deployment
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: llama_3_1_8b_instruct_results
# mode: sequential # enables sequential execution
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
# specify the benchmarks to evaluate
evaluation:
nemo_evaluator_config: # global config settings that apply to all tasks
config:
params:
request_timeout: 3600 # timeout for API requests in seconds
parallelism: 1 # number of parallel requests
limit_samples: 10 # limit number of samples for quick testing
target:
api_endpoint:
adapter_config:
use_reasoning: false # if true, strips reasoning tokens and collects reasoning stats
use_system_prompt: true # enables custom system prompt
custom_system_prompt: >-
"Think step by step."
tasks:
- name: ifeval # use the default benchmark configuration
- name: gpqa_diamond
nemo_evaluator_config: # task-specific configuration for gpqa_diamond
config:
params:
temperature: 0.6 # sampling temperature
top_p: 0.95 # nucleus sampling parameter
max_new_tokens: 8192 # maximum tokens to generate
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
- name: mbpp
nemo_evaluator_config: # task-specific configuration for mbpp
config:
params:
temperature: 0.2 # sampling temperature
top_p: 0.95 # nucleus sampling parameter
max_new_tokens: 2048 # maximum tokens to generate
extra:
n_samples: 5 # sample 5 predictions per prompt
target:
api_endpoint:
adapter_config:
custom_system_prompt: >-
"You must only provide the code implementation"
Output Structure#
The evaluation output is stored in a results.yaml
. Below we give the output for the
command from the Launcher Quickstart Section for the GPQA-Diamond
task.
# This is an exemplar result generated from the command described in the quickstart tutorial,
# with limited samples for faster execution.
command:
"export API_KEY=$API_KEY && simple_evals --model meta/llama-3.1-8b-instruct
--eval_name gpqa_diamond --url https://integrate.api.nvidia.com/v1/chat/completions
--temperature 0.6 --top_p 0.95 --max_tokens 8192 --out_dir /results/gpqa_diamond
--cache_dir /results/gpqa_diamond/cache --num_threads 1 --max_retries 5 --timeout
3600 --first_n 5 --judge_backend openai --judge_request_timeout 600 --judge_max_retries
16 --judge_temperature 0.0 --judge_top_p 0.0001 --judge_max_tokens 1024 "
config:
output_dir: /results
params:
extra: {}
limit_samples: 5
max_new_tokens: 8192
max_retries: null
parallelism: 1
request_timeout: 3600
task: null
temperature: 0.6
top_p: 0.95
supported_endpoint_types: null
type: gpqa_diamond
git_hash: ""
results:
groups:
gpqa_diamond:
metrics:
score:
scores:
micro:
stats:
stddev: 0.4000000000000001
stderr: 0.2
value: 0.2
tasks:
gpqa_diamond:
metrics:
score:
scores:
micro:
stats:
stddev: 0.4000000000000001
stderr: 0.2
value: 0.2
target:
api_endpoint:
adapter_config:
caching_dir: null
discovery:
dirs: []
modules: []
endpoint_type: chat
generate_html_report: true
html_report_size: 5
interceptors:
- config:
system_message: Think step by step.
enabled: true
name: system_message
- config:
cache_dir: /results/cache
max_saved_requests: 5
max_saved_responses: 5
reuse_cached_responses: true
save_requests: true
save_responses: true
enabled: true
name: caching
- config: {}
enabled: true
name: endpoint
- config:
cache_dir: /results/response_stats_cache
logging_aggregated_stats_interval: 100
enabled: true
name: response_stats
log_failed_requests: false
post_eval_hooks:
- config:
html_report_size: 5
report_types:
- html
- json
enabled: true
name: post_eval_report
tracking_requests_stats: true
api_key: API_KEY
model_id: meta/llama-3.1-8b-instruct
stream: null
type: chat
url: https://integrate.api.nvidia.com/v1/chat/completions
Note
It is instructive to compare the configuration file cited above and the resulting one.
The evaluation output contains the following general sections:
Section |
Description |
---|---|
|
The exact command executed to run the evaluation |
|
Evaluation configuration including parameters and settings |
|
Evaluation metrics and scores organized by groups and tasks |
|
Model and API endpoint configuration details |
|
Git commit hash (if available) |
Key Metrics#
Metric |
Description |
---|---|
Score Value |
Primary performance metric |
Standard Deviation |
Measure of score variability |
Standard Error |
Statistical error measure |
In the example output above, the metric used is the micro-average across the samples (thus |
|
The types of metrics available in the results differ for different evaluation harness and task, but they are always presented using the same structure as shown above. |
Additional Output Files#
File Type |
Description |
---|---|
HTML Report |
Human-readable evaluation report |
JSON Report |
Machine-readable evaluation data |
Cache Files |
Cached requests and responses |
Log Files |
Detailed execution logs |
Exporting the Results#
Once the evaluation has finished and the results.yaml
file was produced, the scores can be exported.
In this example we show how the local export works. For information on other exporters, see Exporters.
The results can be exported using the following command:
nemo-evaluator-launcher export <invocation_id> --dest local --format json
This command extracts the scores from the results.yaml
and creates a processed_results.json
file with the following content:
{
"export_timestamp": "2025-10-21T07:09:12.726157",
"benchmarks": {
"ifeval": {
"models": {
"meta/llama-3.1-8b-instruct": [
{
"invocation_id": "dde213e0891bb95b",
"job_id": "dde213e0891bb95b.0",
"harness": "lm-evaluation-harness",
"container": "nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.08.1",
"scores": {
"ifeval_inst_level_loose_acc_inst_level_loose_acc": 0.875,
"ifeval_inst_level_strict_acc_inst_level_strict_acc": 0.875,
"ifeval_prompt_level_loose_acc_prompt_level_loose_acc": 0.8,
"ifeval_prompt_level_strict_acc_prompt_level_strict_acc": 0.8
},
"timestamp": "2025-10-21T07:09:12.680762",
"executor": "local"
}
]
}
},
"gpqa_diamond": {
"models": {
"meta/llama-3.1-8b-instruct": [
{
"invocation_id": "dde213e0891bb95b",
"job_id": "dde213e0891bb95b.1",
"harness": "simple_evals",
"container": "nvcr.io/nvidia/eval-factory/simple-evals:25.08.1",
"scores": {
"gpqa_diamond_score_micro": 0.2
},
"timestamp": "2025-10-21T07:09:12.704944",
"executor": "local"
}
]
}
},
"mbpp": {
"models": {
"meta/llama-3.1-8b-instruct": [
{
"invocation_id": "dde213e0891bb95b",
"job_id": "dde213e0891bb95b.2",
"harness": "bigcode-evaluation-harness",
"container": "nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1",
"scores": {
"mbpp_pass@1_pass@1": 0.0
},
"timestamp": "2025-10-21T07:09:12.726141",
"executor": "local"
}
]
}
}
}
}
The nemo-evaluator-launcher export
can accept multiple invocation IDs and gather results accross different invocations, regardless if they have been run locally or using remote executors (see Executors), e.g.:
nemo-evaluator-launcher export <local-job-id> <slurm-job-id> --dest local --format json --output_dir combined-results
will create the combined-results/processed_results.json
with the same stracuture as in the example above.