Use the Results of Your Job#

After your NVIDIA NeMo Evaluator job completes, you can use the results. Results are returned in a consistent JSON format across all evaluation types, with metrics specific to each evaluation method.

Working with Results#

Obtain Results#

You can obtain your results as a JSON response or a downloaded ZIP file.

Get Job Results

View evaluation results as JSON response

Get Evaluation Results
Download Detailed Results

Download complete evaluation results as a ZIP file

Download Evaluation Results

Parse Results#

You can load and analyze evaluation results using Python. Here is an example of how to access task-level and group-level metrics:

import json

with open('results.json') as f:
    results = json.load(f)

# Access a task-level metric (if present)
if 'tasks' in results:
    for task_name, task_data in results['tasks'].items():
        for metric_name, metric_data in task_data['metrics'].items():
            for score_name, score in metric_data['scores'].items():
                print(f"Task: {task_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")

# Access a group-level metric (if present)
if 'groups' in results:
    for group_name, group_data in results['groups'].items():
        for metric_name, metric_data in group_data['metrics'].items():
            for score_name, score in metric_data['scores'].items():
                print(f"Group: {group_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")

Always check for the presence of tasks or groups before accessing them, as not all result files will contain both.

Visualize Results#

You can visualize evaluation results using Weights and Biases or MLflow. To get started:

  1. Download the evaluation results.

  2. Install the visualization tools:

    ngc registry resource download-version "nvidia/nemo-microservices/evaluator_results_scripts:0.1.0"
    cd evaluator_results_scripts_v0.1.0
    unzip integrations.zip
    
  3. Determine which data visualization tool you want to use:

    • Weights and Biases: Requires API key

    • MLflow: Requires URI key

  4. After downloading and unzipping the scripts package, you’ll find the tool documentation in:

    • Weights and Biases: evaluator_results_scripts_v0.1.0/integrations/w_and_b/ReadME.md

    • MLflow: evaluator_results_scripts_v0.1.0/integrations/MLflow/ReadME.md

    Follow these README files to prepare environment variables and dependencies for the scripts.

  5. Run the appropriate script for your chosen visualization tool:

python3 w_and_b_eval_integration.py \
    --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
    --experiment_name="<EXPERIMENT_NAME>"
python3 mlflow_eval_integration.py \
    --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
    --mlflow_uri "<MLFLOW_URI>:<MLFLOW_PORT>" \
    --experiment_name="<EXPERIMENT_NAME>"

Result Formats#

Evaluation results can contain two types of components:

  1. Tasks: Results specific to individual evaluation tasks, each containing its own set of metrics and scores.

  2. Groups: Optional aggregated results that may combine metrics across multiple tasks or contain specialized group-level computations.

Understanding the scores Object#

Each metric in the results contains a scores object, which is a mapping of score names to their computed values. The structure is:

  • value: The main metric value (e.g., accuracy, BLEU, recall, etc.)

  • stats (optional): An object with additional statistics, such as:

    • count: Number of samples used

    • sum: Sum of all values

    • mean: Average value

    • min/max: Minimum/maximum value

Scores Example
"scores": {
  "accuracy": {
    "value": 0.95,
    "stats": {
      "count": 100,
      "sum": 95,
      "mean": 0.95,
      "min": 0.8,
      "max": 1.0
    }
  }
}

Not all metrics will include the stats field; it is present when additional statistics are computed.

The specific metrics and scores available depend on the evaluation type, as shown in the following result formats.