Evaluation Output#

This page describes the structure and content of evaluation output files generated by NVIDIA NeMo Evaluator. The evaluation output provides comprehensive information about the evaluation run, including configuration details, results, and metadata.

Input Configuration#

The input configuration comes from the command described in the Launcher Quickstart Guide, namely

# Run a quick test evaluation with limited samples
nemo-evaluator-launcher run \
    --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o target.api_endpoint.api_key_name=NGC_API_KEY \
    -o execution.output_dir=./results

Note

For local execution all artifacts are already present on your machine. When working with remote executors such as slurm you can download the artifacts with the following command:

nemo-evaluator-launcher info <invocation_id> --copy-artifacts <DIR>

For the reference purposes, we cite here the launcher config that is used in the command:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: nel-results

target:
  api_endpoint:
    # see https://build.nvidia.com/meta/llama-3_1-8b-instruct for endpoint details
    model_id: meta/llama-3.2-3b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com

# specify the benchmarks to evaluate
evaluation:
  # global config settings that apply to all tasks, unless overridden by task-specific config
  nemo_evaluator_config:
    config:
      params:
        request_timeout: 3600  # timeout for API request in seconds
        parallelism: 1  # 1 parallel request to avoid overloading the server
        limit_samples: 10 # TEST ONLY: Limits all benchmarks to 10 samples for quick testing
  tasks:
    - name: lm-evaluation-harness.ifeval
    - name: simple_evals.gpqa_diamond
      env_vars:
        HF_TOKEN: host:HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
    - name: bigcode-evaluation-harness.mbpp-chat

Output Structure#

After running an evaluation, NeMo Evaluator creates a structured output directory containing various artifacts. If you run the command provided above, it will create the following directory structure inside execution.output_dir (./results in our case):

./results/
├── <timestamp>-<invocation id>
│   ├── gpqa_diamond
│      ├── artifacts
│      ├── logs
│      └── run.sh
│   ├── ifeval
│      ├── artifacts
│      ├── logs
│      └── run.sh
│   ├── mbpp
│      ├── artifacts
│      ├── logs
│      └── run.sh
│   └── run_all.sequential.sh

Each artifacts directory contains output produced by the evaluation job. Such directory will also be created if you use nemo-evaluator or direct container access (see Quickstart to compare different ways of using NeMo Evaluator SDK)

Regardless of the chosen path, the generated artifacts directory will have the following content:

<artifacts_dir>/
│   ├── run_config.yml               # Task-specific configuration used during execution
│   ├── eval_factory_metrics.json    # Evaluation metrics and performance statistics
│   ├── results.yml                  # Detailed results in YAML format
│   ├── report.html                  # Human-readable HTML report
│   ├── report.json                  # JSON format report
│   └── <Task specific artifacts>/   # Task-specific artifacts

These files are standardized and always follow the same structure regardless of the underlying evaluation harness:

File Name

Description

Content

Usage

results.yml

Evaluation results in YAML format

  • Final evaluation scores and metrics

  • Evaluation configuration used

The main results file for programmatic analysis and integration with downstream tools.

run_config.yml

Complete evaluation configuration (all parameters, overrides, and settings) used for the run.

  • Task and model settings

  • Endpoint configuration

  • Interceptor config

  • Evaluation-specific overrides

Enables full reproducibility of evaluations and configuration auditing.

eval_factory_metrics.json

Detailed metrics and performance statistics for the evaluation execution.

  • Request/response timings

  • Token usage

  • Error rates

  • Resource utilization

Performance analysis and failure pattern identification.

report.html and report.json

Example request-response pairs collected during benchmark execution.

  • Human-readable HTML report

  • Machine-readable JSON version with the same content

For sharing, quick review, analysis, and debugging.

Task-specific artifacts

Artifacts procuded by the underlying benchmark (e.g., caches, raw outputs, error logs).

  • Cached queries & responses

  • Source/context data

  • Special task outputs or logs

Advanced troubleshooting, debugging, or domain-specific analysis.

Results file#

The primary evaluation output is stored in a results.yaml. It is standardized accross all evaluation benchmarks and follows the API dataclasses specification.

Below we give the output for the command from the Launcher Quickstart Section for the GPQA-Diamond task.

# This is an exemplar result generated from the command described in the quickstart tutorial,
# with limited samples for faster execution.

command: 'export API_KEY=$API_KEY &&   simple_evals --model meta/llama-3.2-3b-instruct
  --eval_name gpqa_diamond --url https://integrate.api.nvidia.com/v1/chat/completions
  --temperature 0.0 --top_p 1e-05 --max_tokens 4096 --out_dir /results/gpqa_diamond
  --cache_dir /results/gpqa_diamond/cache --num_threads 1 --max_retries 5 --timeout
  3600   --first_n 10        --judge_backend openai  --judge_request_timeout 600  --judge_max_retries
  16  --judge_temperature 0.0  --judge_top_p 0.0001  --judge_max_tokens 1024  '
config:
  output_dir: /results
  params:
    extra: {}
    limit_samples: 10
    max_new_tokens: null
    max_retries: null
    parallelism: 1
    request_timeout: 3600
    task: null
    temperature: null
    top_p: null
  supported_endpoint_types: null
  type: gpqa_diamond
git_hash: ''
results:
  groups:
    gpqa_diamond:
      metrics:
        score:
          scores:
            micro:
              stats:
                stddev: 0.4898979485566356
                stderr: 0.16329931618554522
              value: 0.4
  tasks:
    gpqa_diamond:
      metrics:
        score:
          scores:
            micro:
              stats:
                stddev: 0.4898979485566356
                stderr: 0.16329931618554522
              value: 0.4
target:
  api_endpoint:
    adapter_config:
      caching_dir: null
      discovery:
        dirs: []
        modules: []
      endpoint_type: chat
      generate_html_report: true
      html_report_size: 5
      interceptors:
      - config:
          cache_dir: /results/cache
          max_saved_requests: 5
          max_saved_responses: 5
          reuse_cached_responses: true
          save_requests: true
          save_responses: true
        enabled: true
        name: caching
      - config: {}
        enabled: true
        name: endpoint
      - config:
          cache_dir: /results/response_stats_cache
          logging_aggregated_stats_interval: 100
        enabled: true
        name: response_stats
      log_failed_requests: false
      post_eval_hooks:
      - config:
          html_report_size: 5
          report_types:
          - html
          - json
        enabled: true
        name: post_eval_report
      tracking_requests_stats: true
    api_key: API_KEY
    model_id: meta/llama-3.2-3b-instruct
    stream: null
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions

Note

It is instructive to compare the configuration file cited above and the resulting one.

The evaluation output contains the following general sections:

Section

Description

command

The exact command executed to run the evaluation

config

Evaluation configuration including parameters and settings

results

Evaluation metrics and scores organized by groups and tasks

target

Model and API endpoint configuration details

git_hash

Git commit hash (if available)

The evaluation metrics are available under the results key and are stored in a following structure:

      metrics:
        metric_name:
          scores:
            score_name:
              stats:            # optional set of statistics, e.g.:
                count: 10       # number of values used for computing the score
                min: 0          # minimum of all values used for computing the score
                max: 1          # maximum of all values used for computing the score
                stderr: 0.13    # standard error
              value: 0.42   # score value

In the example output above, the metric used is the micro-average across the samples (thus micro key in the structure) and the standard deviation (stddev) and standard error (stderr) statistics are reported. The types of metrics available in the results differ for different evaluation harness and task, but they are always presented using the same structure as shown above.

Exporting the Results#

Once the evaluation has finished and the results.yaml file was produced, the scores can be exported. In this example we show how the local export works. For information on other exporters, see Exporters.

The results can be exported using the following command:

nemo-evaluator-launcher export <invocation_id> --dest local --format json

This command extracts the scores from the results.yaml and creates a processed_results.json file with the following content:

{
  "export_timestamp": "2025-12-02T11:51:28.366382",
  "benchmarks": {
    "ifeval": {
      "models": {
        "meta/llama-3.2-3b-instruct": [
          {
            "invocation_id": "30500ae952e1eeab",
            "job_id": "30500ae952e1eeab.0",
            "harness": "lm-evaluation-harness",
            "container": "nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.10",
            "scores": {
              "ifeval_inst_level_loose_acc": 0.7777777777777778,
              "ifeval_inst_level_strict_acc": 0.7777777777777778,
              "ifeval_prompt_level_loose_acc": 0.7,
              "ifeval_prompt_level_strict_acc": 0.7
            },
            "timestamp": "2025-12-02T11:51:28.292580",
            "executor": "local"
          }
        ]
      }
    },
    "gpqa_diamond": {
      "models": {
        "meta/llama-3.2-3b-instruct": [
          {
            "invocation_id": "30500ae952e1eeab",
            "job_id": "30500ae952e1eeab.1",
            "harness": "simple_evals",
            "container": "nvcr.io/nvidia/eval-factory/simple-evals:25.10",
            "scores": {
              "gpqa_diamond_score_micro": 0.4
            },
            "timestamp": "2025-12-02T11:51:28.358658",
            "executor": "local"
          }
        ]
      }
    },
    "mbpp": {
      "models": {
        "meta/llama-3.2-3b-instruct": [
          {
            "invocation_id": "30500ae952e1eeab",
            "job_id": "30500ae952e1eeab.2",
            "harness": "bigcode-evaluation-harness",
            "container": "nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10",
            "scores": {
              "mbpp_pass@1": 0.0,
              "mbpp_pass@10": 0.0
            },
            "timestamp": "2025-12-02T11:51:28.366378",
            "executor": "local"
          }
        ]
      }
    }
  }
}

The nemo-evaluator-launcher export can accept multiple invocation IDs and gather results accross different invocations, regardless if they have been run locally or using remote executors (see Executors), e.g.:

nemo-evaluator-launcher export <local-job-id> <slurm-job-id> --dest local --format json --output_dir combined-results

will create the combined-results/processed_results.json with the same stracuture as in the example above.