Use the Results of Your Job#

After your NVIDIA NeMo Evaluator job completes, you can use the results. Results are returned in a consistent JSON format across all evaluation types, with metrics specific to each evaluation method.

Working with Results#

Obtain Results#

You can obtain your results as a JSON response or a downloaded ZIP file.

Get Job Results

View evaluation results as JSON response

Get Evaluation Results

Download Detailed Results

Download complete evaluation results as a ZIP file

Download Evaluation Results

Parse Results#

You can load and analyze evaluation results using Python. Here is an example of how to access task-level and group-level metrics:

import json

with open('results.json') as f:
    results = json.load(f)

# Access a task-level metric (if present)
if 'tasks' in results:
    for task_name, task_data in results['tasks'].items():
        for metric_name, metric_data in task_data['metrics'].items():
            for score_name, score in metric_data['scores'].items():
                print(f"Task: {task_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")

# Access a group-level metric (if present)
if 'groups' in results:
    for group_name, group_data in results['groups'].items():
        for metric_name, metric_data in group_data['metrics'].items():
            for score_name, score in metric_data['scores'].items():
                print(f"Group: {group_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")

Always check for the presence of tasks or groups before accessing them, as not all result files will contain both.

Visualize Results#

You can visualize evaluation results using Weights and Biases or MLflow. To get started:

Download the evaluation results.

Install the visualization tools:

ngc registry resource download-version "nvidia/nemo-microservices/evaluator_results_scripts:0.1.0"
cd evaluator_results_scripts_v0.1.0
unzip integrations.zip

Determine which data visualization tool you want to use:
- Weights and Biases: Requires API key
- MLflow: Requires URI key
After downloading and unzipping the scripts package, you’ll find the tool documentation in:
- Weights and Biases: evaluator_results_scripts_v0.1.0/integrations/w_and_b/ReadME.md
- MLflow: evaluator_results_scripts_v0.1.0/integrations/MLflow/ReadME.md
Follow these README files to prepare environment variables and dependencies for the scripts.
Run the appropriate script for your chosen visualization tool:

Weights and Biases

python3 w_and_b_eval_integration.py \
    --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
    --experiment_name="<EXPERIMENT_NAME>"

MLflow

python3 mlflow_eval_integration.py \
    --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
    --mlflow_uri "<MLFLOW_URI>:<MLFLOW_PORT>" \
    --experiment_name="<EXPERIMENT_NAME>"

Result Formats#

Evaluation results can contain two types of components:

Tasks: Results specific to individual evaluation tasks, each containing its own set of metrics and scores.
Groups: Optional aggregated results that may combine metrics across multiple tasks or contain specialized group-level computations.

Understanding the `scores` Object#

Each metric in the results contains a scores object, which is a mapping of score names to their computed values. The structure is:

value: The main metric value (e.g., accuracy, BLEU, recall, etc.)
stats (optional): An object with additional statistics, such as:
- count: Number of samples used
- sum: Sum of all values
- mean: Average value
- min/max: Minimum/maximum value

Not all metrics will include the stats field; it is present when additional statistics are computed.

The specific metrics and scores available depend on the evaluation type, as shown in the following result formats.