Use the Results of Your Job#
After your NVIDIA NeMo Evaluator job completes, you can use the results. Results are returned in a consistent JSON format across all evaluation types, with metrics specific to each evaluation method.
Working with Results#
Obtain Results#
You can obtain your results as a JSON response or a downloaded ZIP file.
View evaluation results as JSON response
Download complete evaluation results as a ZIP file
Parse Results#
You can load and analyze evaluation results using Python. Here is an example of how to access task-level and group-level metrics:
import json
with open('results.json') as f:
results = json.load(f)
# Access a task-level metric (if present)
if 'tasks' in results:
for task_name, task_data in results['tasks'].items():
for metric_name, metric_data in task_data['metrics'].items():
for score_name, score in metric_data['scores'].items():
print(f"Task: {task_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")
# Access a group-level metric (if present)
if 'groups' in results:
for group_name, group_data in results['groups'].items():
for metric_name, metric_data in group_data['metrics'].items():
for score_name, score in metric_data['scores'].items():
print(f"Group: {group_name}, Metric: {metric_name}, Score: {score_name}, Value: {score['value']}")
Always check for the presence of tasks
or groups
before accessing them, as not all result files will contain both.
Visualize Results#
You can visualize evaluation results using Weights and Biases or MLflow. To get started:
Install the visualization tools:
ngc registry resource download-version "nvidia/nemo-microservices/evaluator_results_scripts:0.1.0" cd evaluator_results_scripts_v0.1.0 unzip integrations.zip
Determine which data visualization tool you want to use:
Weights and Biases: Requires API key
MLflow: Requires URI key
After downloading and unzipping the scripts package, you’ll find the tool documentation in:
Weights and Biases:
evaluator_results_scripts_v0.1.0/integrations/w_and_b/ReadME.md
MLflow:
evaluator_results_scripts_v0.1.0/integrations/MLflow/ReadME.md
Follow these README files to prepare environment variables and dependencies for the scripts.
Run the appropriate script for your chosen visualization tool:
python3 w_and_b_eval_integration.py \
--results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
--experiment_name="<EXPERIMENT_NAME>"
python3 mlflow_eval_integration.py \
--results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" \
--mlflow_uri "<MLFLOW_URI>:<MLFLOW_PORT>" \
--experiment_name="<EXPERIMENT_NAME>"
Result Formats#
Evaluation results can contain two types of components:
Tasks: Results specific to individual evaluation tasks, each containing its own set of metrics and scores.
Groups: Optional aggregated results that may combine metrics across multiple tasks or contain specialized group-level computations.
Understanding the scores
Object#
Each metric in the results contains a scores
object, which is a mapping of score names to their computed values. The structure is:
value
: The main metric value (e.g., accuracy, BLEU, recall, etc.)stats
(optional): An object with additional statistics, such as:count
: Number of samples usedsum
: Sum of all valuesmean
: Average valuemin
/max
: Minimum/maximum value
Scores Example
"scores": {
"accuracy": {
"value": 0.95,
"stats": {
"count": 100,
"sum": 95,
"mean": 0.95,
"min": 0.8,
"max": 1.0
}
}
}
Not all metrics will include the stats
field; it is present when additional statistics are computed.
The specific metrics and scores available depend on the evaluation type, as shown in the following result formats.