Use the Results of Your Job#
After your NVIDIA NeMo Evaluator job completes, you can use the results.
Evaluator API URL#
To get the results of an evaluation job, send a GET
request to the evaluation/jobs/<job_id>/results
or evaluation/jobs/<job_id>/download-results
API.
The URL of the evaluator API depends on where you deploy evaluator and how you configure it.
For more information, refer to NeMo Evaluator Deployment Guide.
The examples in this documentation specify {EVALUATOR_HOSTNAME}
in the code.
Do the following to store the evaluator hostname to use it in your code.
Important
Replace <your evaluator service endpoint>
with your address, such as evaluator.internal.your-company.com
, before you run this code.
export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
import requests
EVALUATOR_HOSTNAME = "<your evaluator service endpoint>"
Get Evaluation Results#
To get evaluation results as a JSON response, send a GET
request to the evaluation/jobs/<job_id>/results
endpoint.
You must provide the ID of the job as shown in the following code.
curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/results" \
-H 'accept: application/json'
import requests
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/results"
response = requests.get(endpoint).json()
response
The following is an example response. Use the rest of this documentation to see examples and reference for the results specific to your scenario.
{
"created_at": "2025-03-19T22:53:43.619932",
"updated_at": "2025-03-19T22:53:43.619934",
"id": "evaluation_result-1234ABCD5678EFGH",
"job": "eval-UVW123XYZ456",
"tasks": {
"exact_match": {
"metrics": {
"exact_match": {
"scores": {
"gsm8k-metric_ranking-1": {
"value": 0.0
},
"gsm8k-metric_ranking-3": {
"value": 0.8
}
}
}
}
},
"exact_match_stderr": {
"metrics": {
"exact_match_stderr": {
"scores": {
"gsm8k-metric_ranking-2": {
"value": 0.0
},
"gsm8k-metric_ranking-4": {
"value": 0.19999999999999998
}
}
}
}
}
},
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"exact_match": {
"value": 0.4
},
"exact_match_stderr": {
"value": 0.09999999999999999
}
}
}
}
}
},
"namespace": "default",
"custom_fields": {}
}
Download Evaluation Results#
To download the results of an evaluation job, send a GET
request to the evaluation/jobs/<job_id>/download-results
API.
This downloads a directory that contains the configuration files, logs, and evaluation results for a specific evaluation job.
curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job_id>/download-results" \
-H 'accept: application/json' \
-o result.zip
import requests
url = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job_id>/download-results"
response = requests.get(url, headers={'accept': 'application/json'}, stream=True)
with open('result.zip', 'wb') as file:
for chunk in response.iter_content():
file.write(chunk)
print("Download completed.")
After the download completes, the results are available in the result.zip file. To unzip the result.zip file on Ubuntu, macOS, or Linux, run the following code.
unzip result.zip -d result
You can find the result files in the results/ folder. For example, if you run an lm-harness evaluation, the results are in automatic/lm_eval_harness/results.
The directory structure will look like this:
.
├── automatic
│ └── lm_eval_harness
│ ├── model_config_meta-llama-3_1-8b-instruct.yaml
│ ├── model_config_meta-llama-3_1-8b-instruct_inference_params.yaml
│ └── results
│ ├── README.md
│ ├── lm-harness-mmlu_str.json
│ ├── lm-harness.json
│ ├── lmharness_meta-llama-3_1-8b-instruct_aggregateresults-run.log
│ ├── lmharness_meta-llama-3_1-8b-instruct_mmlu_str-run.log
└── metadata.json
Big Code Evaluation Results#
Results are returned at the evaluation and task level.
pass@k
is a popular metric for evaluating functional correctness.
Evaluation results are returned in the following format.
{
"created_at": "2025-03-21T16:12:16.938210",
"updated_at": "2025-03-21T16:12:16.938211",
"id": "evaluation_result-1234ABCD5678EFGH",
"job": "eval-UVW123XYZ456",
"tasks": {
"pass@1": {
"metrics": {
"pass@1": {
"scores": {
"humaneval": {
"value": 0.159756097560976
}
}
}
}
},
"pass@1_stderr": {
"metrics": {
"pass@1_stderr": {
"scores": {
"humaneval-metric_ranking-1": {
"value": 0.0128023429085295
}
}
}
}
}
},
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"pass@1": {
"value": 0.159756097560976
},
"pass@1_stderr": {
"value": 0.0128023429085295
}
}
}
}
}
},
"namespace": "default",
"custom_fields": {
}
}
LM Evaluation Harness Evaluation Results#
Results are returned at the evaluation and task level. Evaluation results are returned in the following format.
{
"created_at": "2025-03-19T21:12:58.789224",
"updated_at": "2025-03-19T21:12:58.789226",
"id": "evaluation_result-1234ABCD5678EFGH",
"job": "eval-UVW123XYZ456",
"tasks": {
"exact_match": {
"metrics": {
"exact_match": {
"scores": {
"gsm8k_cot_llama-metric_ranking-1": {
"value": 0.309325246398787
},
"gsm8k_cot_llama-metric_ranking-3": {
"value": 0.374526156178923
}
}
}
}
},
"exact_match_stderr": {
"metrics": {
"exact_match_stderr": {
"scores": {
"gsm8k_cot_llama-metric_ranking-2": {
"value": 0.0127317109250781
},
"gsm8k_cot_llama-metric_ranking-4": {
"value": 0.0133317741584914
}
}
}
}
}
},
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"exact_match": {
"value": 0.341925701288855
},
"exact_match_stderr": {
"value": 0.0130317425417848
}
}
}
}
}
},
"namespace": "default",
"custom_fields": {
}
}
Similarity Metrics Evaluation Results#
The NeMo Evaluator job returns aggregated evaluation results for each of the scorers (metrics) that you specified in the configuration.
Evaluation results are returned in the following format for each scorer.
{
"created_at": "2025-03-05T17:03:01.643861",
"updated_at": "2025-03-05T17:03:01.643862",
"id": "evaluation_result-1234ABCD5678EFGH",
"job": "eval-UVW123XYZ456",
"tasks": {
},
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"accuracy": {
"value": 0.0444444444444444
},
"bleu_score": {
"value": 0.0813085085759745
},
"rouge_1_score": {
"value": 0.277633859731954
},
"rouge_2_score-metric_ranking-1": {
"value": 0.139289906138245
},
"rouge_3_score-metric_ranking-2": {
"value": 0.0591258646114323
},
"rouge_L_score-metric_ranking-3": {
"value": 0.272577935264265
}
}
}
}
}
},
"namespace": "default",
"custom_fields": {
}
}
LLM-as-a-Judge Evaluation Results#
The following python script can be used to download the generated results:
import huggingface_hub as hh
import requests
url = "<NeMo Data Store URL>"
token = "mock"
repo_name = "<evaluation id>"
download_path = "<Path where results will be downloaded>"
repo_name = f'nvidia/{repo_name}'
api = hh.HfApi(endpoint=url, token=token)
repo_type = 'dataset'
api.snapshot_download(repo_id=repo_name, repo_type=repo_type, local_dir=download_path, local_dir_use_symlinks=False)
The downloaded results directory will have the following structure:
|-- mt_bench
| | -- model_answer
| | | -- <llm_name>.jsonl
| | -- model_judgement
| | | -- <llm_name>.jsonl
| | -- reference_answer
| | | -- <reference>.jsonl
| | -- question.jsonl
| | -- judge_prompts.jsonl
|-- results
| | -- <llm_name>.csv
User LLM answers:
mt_bench/model_answer/<llm_name>.jsonl
file with with User LLM responses for each prompt in the evaluation datasetJudge LLM responses:
mt_bench/model_judgment/<llm_name>.jsonl
file containing the Judge LLM ratings, with explanation, for each User LLM answerAggregated scores: Aggregated scores are returned as a
.csv
with the following structure -Category
Score out of 10
total
1.57
humanities
2.4
reasoning
1.0
writing
1.3
coding
1.3
stem
2.1
roleplay
1.73
math
1.0
extraction
2.0
turn 1
1.64
turn 2
1.09
Each row in the csv describes a score from 1 to 10, where 1 signifies the weakest evaluation and 10 the strongest, for a given category of evaluation
The
total
refers to the average score across all categoriesThe
turn 1
andturn 2
scores are average scores for the respective turnsFor custom evaluations, the categories will follow what’s provided in the custom dataset
The Judge LLM must provide ratings in a specific format of
[[rating]]
. A warning will be provided in the.csv
file if the Jude LLM failed to generate a rating in the required format for one or more questions. The inference parameters can be adjusted or a different Judge LLM can be used in this case
Custom prompt for custom dataset#
When custom dataset is provided for judgement, chances are that users want to provide more guidelines for the judge for example, let judge understand the context about the question so that judge can do better job in making the decision.
reference.jsonl
can be utialized for more usages instead of using as a reference answer since the reference entry will be inserted into each prompt.
Here are two use cases with judgement only evaluation.
use case 1: provide background knowledge to the judge#
In this case, users want to provide some background knowledge to the judge for judging the model answer.
In judge_prompt.jsonl
, we need to modify the prompt_template
accordingly:
{"name": "single-ref-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. You will be provided some background knowledge in the context section and should check if the answer match the background knowledge. After providing your explanation, you must rate the response on a float scale of 0 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[3]]\".\n\n[Question]\n{question}\n\n[The Start of Context]\n{ref_answer_1}\n[The End of Context]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": ["general"], "output_format": "[[rating]]"}
Similarly, we can apply to multi-turn prompt_template as well.
In reference.jsonl
, we can provide background knowledge for each question-answer pair. Example:
{"question_id": "100", "choices": [{"index": 0, "turns": ["The user has been very sick recently"]}]}
This background knowledge will be placed in {ref_answer_1}
.
Similarly, we can apply to multi-turn reference as well.
use case 2: provide multiple custom prompt to the judge#
We can then expand use case 1 to create highly customized prompt for each question by utilizing reference.jsonl
. Suppose we have 3 types of custom prompt needed for judge.
ground truth: the expert answer to the question. Judge should compare the model answer with the ground truth to have a better judgement.
context: the background knowledge that judge should know for judging the model answer. Check out use case 1.
assertion: some assertions for judge model to verify if the model answer has covered certain aspects. For example, users want the model answer to use respectful language or explain with an example.
For each question and answer pair, we would like the judge to consider all the 3 types of context above and create custom prompts for each model answer. It is feasible by minor tweaks.
Suppose we have stored the information in a csv file:
question_id, ground_truth, context, assertions
100, "First, you should go to see the doctor and have a thorough medical exam. Next, based on the doctor's suggestion, take medicines and rest.", "This user has been sick recently.", "Does the answer suggest go to see the doctor?"
After data transformation, we can have the following reference.jsonl
:
{"question_id": "100", "choices": [{"index": 0, "turns": ["Assertions [model answer should cover the following key points]: Does the answer suggest go to see the doctor?\n [An expert's answer as a reference]: First, you should go to see the doctor and have a thorough medical exam. Next, based on the doctor's suggestion, take medicines and rest. \n [Context about the user]: This user has been sick recently.", "null"]}]}
we should update the judge_prompt.jsonl
a bit as well:
{"name": "single-ref-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. You will be provided three types of context in the context section: ground truth, context and assertion. After analyzing all the context, you must rate the response on a float scale of 0 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[3]]\".\n\n[Question]\n{question}\n\n[The Start of Context]\n{ref_answer_1}\n[The End of Context]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": ["general"], "output_format": "[[rating]]"}
In this way, we can customize the prompt with more or less context or other information that judge needs to know.
Retriever Pipeline Evaluation Results#
Evaluation results are returned in the following format.
recall@k
: Recall at k is calculated as the fraction of the relevant documents that are successfully retrieved within the top k extracted documents. Higher values indicate better performance.k
is set by the user, with acceptable value ranging from 1 to thetop_k
value of the retriever model.ndcg@k/ndcg_cut_k
: Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). Higher values indicate better performance.k
is set by the user, with acceptable value ranging from 1 to thetop_k
value of the retriever model.
{
"created_at": "2025-03-29T07:16:54.298605",
"updated_at": "2025-03-29T07:16:54.298607",
"id": "evaluation_result-1234ABCD5678EFGH",
"job": "eval-UVW123XYZ456",
"tasks": {},
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"recall_10": {
"value": 0.5219448247226026
},
"ndcg_cut_5": {
"value": 0.43118519470524036
},
"ndcg_cut_10": {
"value": 0.4548908807830673
},
"recall_5": {
"value": 0.4455075402992067
}
}
}
}
}
},
"namespace": "default",
"custom_fields": {}
}
RAG Pipeline Evaluation Results#
Evaluation Results are returned in the following format.
{
"created_at": "2025-03-29T07:16:54.298605",
"updated_at": "2025-03-29T07:16:54.298607",
"id": "evaluation_result-1234ABCD5678EFGH",
"job": "eval-UVW123XYZ456",
"tasks": {},
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"recall_10": {
"value": 0.5219448247226026
},
"ndcg_cut_5": {
"value": 0.43118519470524036
},
"ndcg_cut_10": {
"value": 0.4548908807830673
},
"recall_5": {
"value": 0.4455075402992067
},
"faithfulness": {
"value": 0.7811975946247776
}
}
}
}
}
},
"namespace": "default",
"custom_fields": {}
}
Document retrieval:
recall@k
: Recall at k is calculated as the fraction of the relevant documents that are successfully retrieved within the top k extracted documents. Higher values indicate better performance.k
is set by the user, with acceptable value ranging from 1 to thetop_k
value of the Retriever.ndcg@k/ndcg_cut_k
: Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). Higher values indicate better performance.k
is set by the user, with acceptable value ranging from 1 to thetop_k
value of the Retriever.
Answer generation:
faithfulness
: Measures the factual consistency of the generated answer against the provided context. The score ranges from 0 to 1, with higher values indicating greater accuracy. This metric uses ajudge_llm
. This metric can be used whendataset_format
is set tobeir
orsquad
. This metric can be used whendataset_format
is set toragas
and the columnsquestion
,answer
, andcontexts
are present in the data.answer_relevancy
: Measures how relevant the generated answer is to the given prompt, evaluated using an LLM and an Embedding model. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric uses ajudge_llm
and ajudge_embeddings
. This metric can be used whendataset_format
is set tobeir
orsquad
. This metric can be used whendataset_format
is set toragas
and the columnsquestion
andanswer
are present in the data.answer_correctness
: Accuracy of the generated answer when compared to the ground truth. The score ranges from 0 to 1, with higher values indicating better correctness. This metric uses ajudge_llm
and ajudge_embeddings
. This metric can not be used whendataset_format
is set tobeir
orsquad
. This metric can be used whendataset_format
is set toragas
and the columnsquestion
,answer
, andground_truth
are present in the data.answer_similarity
: Semantic similarity between the generated answer and the ground truth. The score ranges from 0 to 1, with higher values indicating better alignment. This metric uses ajudge_llm
and ajudge_embeddings
. This metric can not be used whendataset_format
is set tobeir
orsquad
. This metric can be used whendataset_format
is set toragas
and the columnsground_truth
andanswer
are present in the data.context_precision
: Are ground-truth relevant items ranked higher or not. The score ranges from 0 to 1, with higher values indicate better precision. This metric uses ajudge_llm
. This metric can not be used whendataset_format
is set tobeir
orsquad
. This metric can be used whendataset_format
is set toragas
and the columnsquestion
,contexts
, andground_truth
are present in the data.context_recall
: Does the retrieved context aligns with the ground_truth answer. The score ranges from 0 to 1, with higher values indicating better performance. This metric uses ajudge_llm
. This metric can not be used whendataset_format
is set tobeir
orsquad
. This metric can be used whendataset_format
is set toragas
and the columnsquestion
,contexts
, andground_truth
are present in the data.
Visualize Evaluation Results with Weights and Biases or MLFlow#
You can use Weights and Biases and MLFlow python scripts located in NVIDIA NGC to upload evaluation results to supported visualization tools. Use the following procedure to get the scripts.
Use the following NGC CLI code to download the zip file that contains the scripts.
ngc registry resource download-version "nvidia/nemo-microservices/evaluator_results_scripts:0.1.0"
Unzip the script files.
cd evaluator_results_scripts_v0.1.0 unzip integrations.zip
Use the following procedure to upload evaluation results.
Download the evaluation job results by using the download-results Evaluator API endpoint. For details, see Download Evaluation Results.
Determine which data visualization tool you want to use, MLFlow or Weights and Biases, and verify that you have the MLFlow URI key or Weights and Biases API key.
Follow the documentation for the visualization tool found in the Weights and Biases README (./integrations/w_and_b/ReadME.md) or MLFlow README (./integrations/MLFlow/ReadME.md) to prepare environment variables and dependencies for the scripts.
Run the script by following the downloaded README. Ensure that the results that you downloaded earlier and specified in the script ends at the folder that contains the json or csv output file.
Example command for weights and biases:
python3 w_and_b_eval_integration.py --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" --experiment_name="<EXPERIMENT_NAME>"
Example command for MLFlow:
python3 mlflow_eval_integration.py --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" --mlflow_uri "<MLFLOW_URI>:<MLFLOW_PORT>" --experiment_name="<EXPERIMENT_NAME>"