Use the Results of Your Job#
After your NVIDIA NeMo Evaluator job completes, you can use the results. Results are returned in a consistent JSON format across all evaluation types, with metrics specific to each evaluation method.
Working with Results#
Obtain Results#
You can obtain your results as a JSON response or a downloaded ZIP file.
View evaluation results as JSON response
Download complete evaluation results as a ZIP file
Parse Results#
You can load and analyze evaluation results using Python. Here is an example of how to access task-level and group-level metrics:
import json
with open('results.json') as f:
    results = json.load(f)
# Access a task-level metric (if present)
if 'tasks' in results:
    for task_name, task_data in results['tasks'].items():
        for metric_name, metric_data in task_data['metrics'].items():
            for score_name, score in metric_data['scores'].items():
                print(f"Task: {task_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")
# Access a group-level metric (if present)
if 'groups' in results:
    for group_name, group_data in results['groups'].items():
        for metric_name, metric_data in group_data['metrics'].items():
            for score_name, score in metric_data['scores'].items():
                print(f"Group: {group_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")
Always check for the presence of tasks or groups before accessing them, as not all result files will contain both.
Visualize Results#
You can visualize evaluation results using Weights and Biases or MLflow. To get started:
- Install the visualization tools: - ngc registry resource download-version "nvidia/nemo-microservices/evaluator_results_scripts:0.1.0" cd evaluator_results_scripts_v0.1.0 unzip integrations.zip 
- Determine which data visualization tool you want to use: - Weights and Biases: Requires API key 
- MLflow: Requires URI key 
 
- After downloading and unzipping the scripts package, you’ll find the tool documentation in: - Weights and Biases: - evaluator_results_scripts_v0.1.0/integrations/w_and_b/ReadME.md
- MLflow: - evaluator_results_scripts_v0.1.0/integrations/MLflow/ReadME.md
 - Follow these README files to prepare environment variables and dependencies for the scripts. 
- Run the appropriate script for your chosen visualization tool: 
python3 w_and_b_eval_integration.py \
    --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
    --experiment_name="<EXPERIMENT_NAME>"
python3 mlflow_eval_integration.py \
    --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
    --mlflow_uri "<MLFLOW_URI>:<MLFLOW_PORT>" \
    --experiment_name="<EXPERIMENT_NAME>"
Result Formats#
Evaluation results can contain two types of components:
- Tasks: Results specific to individual evaluation tasks, each containing its own set of metrics and scores. 
- Groups: Optional aggregated results that may combine metrics across multiple tasks or contain specialized group-level computations. 
Understanding the scores Object#
Each metric in the results contains a scores object, which is a mapping of score names to their computed values. The structure is:
- value: The main metric value (e.g., accuracy, BLEU, recall, etc.)
- stats(optional): An object with additional statistics, such as:- count: Number of samples used
- sum: Sum of all values
- mean: Average value
- min/- max: Minimum/maximum value
 
Scores Example
"scores": {
  "accuracy": {
    "value": 0.95,
    "stats": {
      "count": 100,
      "sum": 95,
      "mean": 0.95,
      "min": 0.8,
      "max": 1.0
    }
  }
}
Not all metrics will include the stats field; it is present when additional statistics are computed.
The specific metrics and scores available depend on the evaluation type, as shown in the following result formats.