Evaluation Output#
This page describes the structure and content of evaluation output files generated by NVIDIA NeMo Evaluator. The evaluation output provides comprehensive information about the evaluation run, including configuration details, results, and metadata.
Input Configuration#
The input configuration comes from the command described in the Launcher Quickstart Guide, namely
# Run a quick test evaluation with limited samples
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o target.api_endpoint.api_key_name=NGC_API_KEY \
-o execution.output_dir=./results
Note
For local execution all artifacts are already present on your machine.
When working with remote executors such as slurm you can download the artifacts with the following command:
nemo-evaluator-launcher info <invocation_id> --copy-artifacts <DIR>
For the reference purposes, we cite here the launcher config that is used in the command:
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results
target:
api_endpoint:
# see https://build.nvidia.com/meta/llama-3_1-8b-instruct for endpoint details
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
# specify the benchmarks to evaluate
evaluation:
# global config settings that apply to all tasks, unless overridden by task-specific config
nemo_evaluator_config:
config:
params:
request_timeout: 3600 # timeout for API request in seconds
parallelism: 1 # 1 parallel request to avoid overloading the server
limit_samples: 10 # TEST ONLY: Limits all benchmarks to 10 samples for quick testing
tasks:
- name: lm-evaluation-harness.ifeval
- name: simple_evals.gpqa_diamond
env_vars:
HF_TOKEN: host:HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
- name: bigcode-evaluation-harness.mbpp-chat
Output Structure#
After running an evaluation, NeMo Evaluator creates a structured output directory containing various artifacts.
If you run the command provided above, it will create the following directory structure inside execution.output_dir (./results in our case):
./results/
├── <timestamp>-<invocation id>
│ ├── gpqa_diamond
│ │ ├── artifacts
│ │ ├── logs
│ │ └── run.sh
│ ├── ifeval
│ │ ├── artifacts
│ │ ├── logs
│ │ └── run.sh
│ ├── mbpp
│ │ ├── artifacts
│ │ ├── logs
│ │ └── run.sh
│ └── run_all.sequential.sh
Each artifacts directory contains output produced by the evaluation job.
Such directory will also be created if you use nemo-evaluator or direct container access (see Quickstart to compare different ways of using NeMo Evaluator SDK)
Regardless of the chosen path, the generated artifacts directory will have the following content:
<artifacts_dir>/
│ ├── run_config.yml # Task-specific configuration used during execution
│ ├── eval_factory_metrics.json # Evaluation metrics and performance statistics
│ ├── results.yml # Detailed results in YAML format
│ ├── report.html # Human-readable HTML report
│ ├── report.json # JSON format report
│ └── <Task specific artifacts>/ # Task-specific artifacts
These files are standardized and always follow the same structure regardless of the underlying evaluation harness:
File Name |
Description |
Content |
Usage |
|---|---|---|---|
|
Evaluation results in YAML format |
|
The main results file for programmatic analysis and integration with downstream tools. |
|
Complete evaluation configuration (all parameters, overrides, and settings) used for the run. |
|
Enables full reproducibility of evaluations and configuration auditing. |
|
Detailed metrics and performance statistics for the evaluation execution. |
|
Performance analysis and failure pattern identification. |
|
Example request-response pairs collected during benchmark execution. |
|
For sharing, quick review, analysis, and debugging. |
Task-specific artifacts |
Artifacts procuded by the underlying benchmark (e.g., caches, raw outputs, error logs). |
|
Advanced troubleshooting, debugging, or domain-specific analysis. |
Results file#
The primary evaluation output is stored in a results.yaml.
It is standardized accross all evaluation benchmarks and follows the API dataclasses specification.
Below we give the output for the
command from the Launcher Quickstart Section for the GPQA-Diamond task.
# This is an exemplar result generated from the command described in the quickstart tutorial,
# with limited samples for faster execution.
command: 'export API_KEY=$API_KEY && simple_evals --model meta/llama-3.2-3b-instruct
--eval_name gpqa_diamond --url https://integrate.api.nvidia.com/v1/chat/completions
--temperature 0.0 --top_p 1e-05 --max_tokens 4096 --out_dir /results/gpqa_diamond
--cache_dir /results/gpqa_diamond/cache --num_threads 1 --max_retries 5 --timeout
3600 --first_n 10 --judge_backend openai --judge_request_timeout 600 --judge_max_retries
16 --judge_temperature 0.0 --judge_top_p 0.0001 --judge_max_tokens 1024 '
config:
output_dir: /results
params:
extra: {}
limit_samples: 10
max_new_tokens: null
max_retries: null
parallelism: 1
request_timeout: 3600
task: null
temperature: null
top_p: null
supported_endpoint_types: null
type: gpqa_diamond
git_hash: ''
results:
groups:
gpqa_diamond:
metrics:
score:
scores:
micro:
stats:
stddev: 0.4898979485566356
stderr: 0.16329931618554522
value: 0.4
tasks:
gpqa_diamond:
metrics:
score:
scores:
micro:
stats:
stddev: 0.4898979485566356
stderr: 0.16329931618554522
value: 0.4
target:
api_endpoint:
adapter_config:
caching_dir: null
discovery:
dirs: []
modules: []
endpoint_type: chat
generate_html_report: true
html_report_size: 5
interceptors:
- config:
cache_dir: /results/cache
max_saved_requests: 5
max_saved_responses: 5
reuse_cached_responses: true
save_requests: true
save_responses: true
enabled: true
name: caching
- config: {}
enabled: true
name: endpoint
- config:
cache_dir: /results/response_stats_cache
logging_aggregated_stats_interval: 100
enabled: true
name: response_stats
log_failed_requests: false
post_eval_hooks:
- config:
html_report_size: 5
report_types:
- html
- json
enabled: true
name: post_eval_report
tracking_requests_stats: true
api_key: API_KEY
model_id: meta/llama-3.2-3b-instruct
stream: null
type: chat
url: https://integrate.api.nvidia.com/v1/chat/completions
Note
It is instructive to compare the configuration file cited above and the resulting one.
The evaluation output contains the following general sections:
Section |
Description |
|---|---|
|
The exact command executed to run the evaluation |
|
Evaluation configuration including parameters and settings |
|
Evaluation metrics and scores organized by groups and tasks |
|
Model and API endpoint configuration details |
|
Git commit hash (if available) |
The evaluation metrics are available under the results key and are stored in a following structure:
metrics:
metric_name:
scores:
score_name:
stats: # optional set of statistics, e.g.:
count: 10 # number of values used for computing the score
min: 0 # minimum of all values used for computing the score
max: 1 # maximum of all values used for computing the score
stderr: 0.13 # standard error
value: 0.42 # score value
In the example output above, the metric used is the micro-average across the samples (thus micro key in the structure) and the standard deviation (stddev) and standard error (stderr) statistics are reported.
The types of metrics available in the results differ for different evaluation harness and task, but they are always presented using the same structure as shown above.
Exporting the Results#
Once the evaluation has finished and the results.yaml file was produced, the scores can be exported.
In this example we show how the local export works. For information on other exporters, see Exporters.
The results can be exported using the following command:
nemo-evaluator-launcher export <invocation_id> --dest local --format json
This command extracts the scores from the results.yaml and creates a processed_results.json file with the following content:
{
"export_timestamp": "2025-12-02T11:51:28.366382",
"benchmarks": {
"ifeval": {
"models": {
"meta/llama-3.2-3b-instruct": [
{
"invocation_id": "30500ae952e1eeab",
"job_id": "30500ae952e1eeab.0",
"harness": "lm-evaluation-harness",
"container": "nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.10",
"scores": {
"ifeval_inst_level_loose_acc": 0.7777777777777778,
"ifeval_inst_level_strict_acc": 0.7777777777777778,
"ifeval_prompt_level_loose_acc": 0.7,
"ifeval_prompt_level_strict_acc": 0.7
},
"timestamp": "2025-12-02T11:51:28.292580",
"executor": "local"
}
]
}
},
"gpqa_diamond": {
"models": {
"meta/llama-3.2-3b-instruct": [
{
"invocation_id": "30500ae952e1eeab",
"job_id": "30500ae952e1eeab.1",
"harness": "simple_evals",
"container": "nvcr.io/nvidia/eval-factory/simple-evals:25.10",
"scores": {
"gpqa_diamond_score_micro": 0.4
},
"timestamp": "2025-12-02T11:51:28.358658",
"executor": "local"
}
]
}
},
"mbpp": {
"models": {
"meta/llama-3.2-3b-instruct": [
{
"invocation_id": "30500ae952e1eeab",
"job_id": "30500ae952e1eeab.2",
"harness": "bigcode-evaluation-harness",
"container": "nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10",
"scores": {
"mbpp_pass@1": 0.0,
"mbpp_pass@10": 0.0
},
"timestamp": "2025-12-02T11:51:28.366378",
"executor": "local"
}
]
}
}
}
}
The nemo-evaluator-launcher export can accept multiple invocation IDs and gather results accross different invocations, regardless if they have been run locally or using remote executors (see Executors), e.g.:
nemo-evaluator-launcher export <local-job-id> <slurm-job-id> --dest local --format json --output_dir combined-results
will create the combined-results/processed_results.json with the same stracuture as in the example above.