Evaluation Output#

This page describes the structure and content of evaluation output files generated by NVIDIA NeMo Evaluator. The evaluation output provides comprehensive information about the evaluation run, including configuration details, results, and metadata.

Input Configuration#

The input configuration comes from the command described in the Launcher Quickstart Guide, namely

# Run a quick test evaluation with limited samples
nemo-evaluator-launcher run \
    --config-dir packages/nemo-evaluator-launcher/examples \
    --config-name local_llama_3_1_8b_instruct_limit_samples \
    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o target.api_endpoint.api_key_name=NGC_API_KEY \
    -o execution.output_dir=./results

After running it, you can copy the artifacts folder using

nemo-evaluator-launcher debug <invocation_id> --copy-artifacts <DIR>

and find this file under the the ./artifacts subfolder.

Note

There are several ways to retrieve the artifacts. Above is the way that works across executors, including e.g. slurm, by downloading them.

For the reference purposes, we cite here the launcher config that is used in the command:

# specify default configs for execution and deployment
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: llama_3_1_8b_instruct_results
  # mode: sequential  # enables sequential execution

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com

# specify the benchmarks to evaluate
evaluation:
  nemo_evaluator_config:  # global config settings that apply to all tasks
    config:
      params:
        request_timeout: 3600  # timeout for API requests in seconds
        parallelism: 1  # number of parallel requests
        limit_samples: 10  # limit number of samples for quick testing
      target:
        api_endpoint:
          adapter_config:
            use_reasoning: false  # if true, strips reasoning tokens and collects reasoning stats
            use_system_prompt: true  # enables custom system prompt
            custom_system_prompt: >-
              "Think step by step."
  tasks:
    - name: ifeval  # use the default benchmark configuration
    - name: gpqa_diamond
      nemo_evaluator_config:  # task-specific configuration for gpqa_diamond
        config:
          params:
            temperature: 0.6  # sampling temperature
            top_p: 0.95  # nucleus sampling parameter
            max_new_tokens: 8192  # maximum tokens to generate
      env_vars:
        HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
    - name: mbpp
      nemo_evaluator_config:  # task-specific configuration for mbpp
        config:
          params:
            temperature: 0.2  # sampling temperature
            top_p: 0.95  # nucleus sampling parameter
            max_new_tokens: 2048  # maximum tokens to generate
            extra:
              n_samples: 5  # sample 5 predictions per prompt
          target:
            api_endpoint:
              adapter_config:
                custom_system_prompt: >-
                  "You must only provide the code implementation"

Output Structure#

The evaluation output is stored in a results.yaml. Below we give the output for the command from the Launcher Quickstart Section for the GPQA-Diamond task.

# This is an exemplar result generated from the command described in the quickstart tutorial,
# with limited samples for faster execution.

command:
  "export API_KEY=$API_KEY &&   simple_evals --model meta/llama-3.1-8b-instruct
  --eval_name gpqa_diamond --url https://integrate.api.nvidia.com/v1/chat/completions
  --temperature 0.6 --top_p 0.95 --max_tokens 8192 --out_dir /results/gpqa_diamond
  --cache_dir /results/gpqa_diamond/cache --num_threads 1 --max_retries 5 --timeout
  3600   --first_n 5        --judge_backend openai  --judge_request_timeout 600  --judge_max_retries
  16  --judge_temperature 0.0  --judge_top_p 0.0001  --judge_max_tokens 1024  "
config:
  output_dir: /results
  params:
    extra: {}
    limit_samples: 5
    max_new_tokens: 8192
    max_retries: null
    parallelism: 1
    request_timeout: 3600
    task: null
    temperature: 0.6
    top_p: 0.95
  supported_endpoint_types: null
  type: gpqa_diamond
git_hash: ""
results:
  groups:
    gpqa_diamond:
      metrics:
        score:
          scores:
            micro:
              stats:
                stddev: 0.4000000000000001
                stderr: 0.2
              value: 0.2
  tasks:
    gpqa_diamond:
      metrics:
        score:
          scores:
            micro:
              stats:
                stddev: 0.4000000000000001
                stderr: 0.2
              value: 0.2
target:
  api_endpoint:
    adapter_config:
      caching_dir: null
      discovery:
        dirs: []
        modules: []
      endpoint_type: chat
      generate_html_report: true
      html_report_size: 5
      interceptors:
        - config:
            system_message: Think step by step.
          enabled: true
          name: system_message
        - config:
            cache_dir: /results/cache
            max_saved_requests: 5
            max_saved_responses: 5
            reuse_cached_responses: true
            save_requests: true
            save_responses: true
          enabled: true
          name: caching
        - config: {}
          enabled: true
          name: endpoint
        - config:
            cache_dir: /results/response_stats_cache
            logging_aggregated_stats_interval: 100
          enabled: true
          name: response_stats
      log_failed_requests: false
      post_eval_hooks:
        - config:
            html_report_size: 5
            report_types:
              - html
              - json
          enabled: true
          name: post_eval_report
      tracking_requests_stats: true
    api_key: API_KEY
    model_id: meta/llama-3.1-8b-instruct
    stream: null
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions

Note

It is instructive to compare the configuration file cited above and the resulting one.

The evaluation output contains the following general sections:

Section

Description

command

The exact command executed to run the evaluation

config

Evaluation configuration including parameters and settings

results

Evaluation metrics and scores organized by groups and tasks

target

Model and API endpoint configuration details

git_hash

Git commit hash (if available)

Key Metrics#

Metric

Description

Score Value

Primary performance metric

Standard Deviation

Measure of score variability

Standard Error

Statistical error measure

In the example output above, the metric used is the micro-average across the samples (thus micro key in the structure).

The types of metrics available in the results differ for different evaluation harness and task, but they are always presented using the same structure as shown above.

Additional Output Files#

File Type

Description

HTML Report

Human-readable evaluation report

JSON Report

Machine-readable evaluation data

Cache Files

Cached requests and responses

Log Files

Detailed execution logs

Exporting the Results#

Once the evaluation has finished and the results.yaml file was produced, the scores can be exported. In this example we show how the local export works. For information on other exporters, see Exporters.

The results can be exported using the following command:

nemo-evaluator-launcher export <invocation_id> --dest local --format json

This command extracts the scores from the results.yaml and creates a processed_results.json file with the following content:

{
  "export_timestamp": "2025-10-21T07:09:12.726157",
  "benchmarks": {
    "ifeval": {
      "models": {
        "meta/llama-3.1-8b-instruct": [
          {
            "invocation_id": "dde213e0891bb95b",
            "job_id": "dde213e0891bb95b.0",
            "harness": "lm-evaluation-harness",
            "container": "nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.08.1",
            "scores": {
              "ifeval_inst_level_loose_acc_inst_level_loose_acc": 0.875,
              "ifeval_inst_level_strict_acc_inst_level_strict_acc": 0.875,
              "ifeval_prompt_level_loose_acc_prompt_level_loose_acc": 0.8,
              "ifeval_prompt_level_strict_acc_prompt_level_strict_acc": 0.8
            },
            "timestamp": "2025-10-21T07:09:12.680762",
            "executor": "local"
          }
        ]
      }
    },
    "gpqa_diamond": {
      "models": {
        "meta/llama-3.1-8b-instruct": [
          {
            "invocation_id": "dde213e0891bb95b",
            "job_id": "dde213e0891bb95b.1",
            "harness": "simple_evals",
            "container": "nvcr.io/nvidia/eval-factory/simple-evals:25.08.1",
            "scores": {
              "gpqa_diamond_score_micro": 0.2
            },
            "timestamp": "2025-10-21T07:09:12.704944",
            "executor": "local"
          }
        ]
      }
    },
    "mbpp": {
      "models": {
        "meta/llama-3.1-8b-instruct": [
          {
            "invocation_id": "dde213e0891bb95b",
            "job_id": "dde213e0891bb95b.2",
            "harness": "bigcode-evaluation-harness",
            "container": "nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1",
            "scores": {
              "mbpp_pass@1_pass@1": 0.0
            },
            "timestamp": "2025-10-21T07:09:12.726141",
            "executor": "local"
          }
        ]
      }
    }
  }
}

The nemo-evaluator-launcher export can accept multiple invocation IDs and gather results accross different invocations, regardless if they have been run locally or using remote executors (see Executors), e.g.:

nemo-evaluator-launcher export <local-job-id> <slurm-job-id> --dest local --format json --output_dir combined-results

will create the combined-results/processed_results.json with the same stracuture as in the example above.