Use the Results of Your Job#

After your NVIDIA NeMo Evaluator job completes, you can use the results.

Evaluator API URL#

To get the results of an evaluation job, send a GET request to the evaluation/jobs/<job_id>/results or evaluation/jobs/<job_id>/download-results API. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. For more information, refer to NeMo Evaluator Deployment Guide.

The examples in this documentation specify {EVALUATOR_HOSTNAME} in the code. Do the following to store the evaluator hostname to use it in your code.

Important

Replace <your evaluator service endpoint> with your address, such as evaluator.internal.your-company.com, before you run this code.

export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
import requests

EVALUATOR_HOSTNAME = "<your evaluator service endpoint>" 

Get Evaluation Results#

To get evaluation results as a JSON response, send a GET request to the evaluation/jobs/<job_id>/results endpoint. You must provide the ID of the job as shown in the following code.

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/results" \
  -H 'accept: application/json'
import requests

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/results"
response = requests.get(endpoint).json()
response

The following is an example response. Use the rest of this documentation to see examples and reference for the results specific to your scenario.

{
    "created_at": "2025-03-19T22:53:43.619932",
    "updated_at": "2025-03-19T22:53:43.619934",
    "id": "evaluation_result-1234ABCD5678EFGH",
    "job": "eval-UVW123XYZ456",
    "tasks": {
        "exact_match": {
            "metrics": {
                "exact_match": {
                    "scores": {
                        "gsm8k-metric_ranking-1": {
                            "value": 0.0
                        },
                        "gsm8k-metric_ranking-3": {
                            "value": 0.8
                        }
                    }
                }
            }
        },
        "exact_match_stderr": {
            "metrics": {
                "exact_match_stderr": {
                    "scores": {
                        "gsm8k-metric_ranking-2": {
                            "value": 0.0
                        },
                        "gsm8k-metric_ranking-4": {
                            "value": 0.19999999999999998
                        }
                    }
                }
            }
        }
    },
    "groups": {
        "evaluation": {
            "metrics": {
                "evaluation": {
                    "scores": {
                        "exact_match": {
                            "value": 0.4
                        },
                        "exact_match_stderr": {
                            "value": 0.09999999999999999
                        }
                    }
                }
            }
        }
    },
    "namespace": "default",
    "custom_fields": {}
}

Download Evaluation Results#

To download the results of an evaluation job, send a GET request to the evaluation/jobs/<job_id>/download-results API. This downloads a directory that contains the configuration files, logs, and evaluation results for a specific evaluation job.

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job_id>/download-results" \
-H 'accept: application/json' \
-o result.zip
import requests

url = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job_id>/download-results"

response = requests.get(url, headers={'accept': 'application/json'}, stream=True)

with open('result.zip', 'wb') as file:
    for chunk in response.iter_content():
        file.write(chunk)

print("Download completed.")

After the download completes, the results are available in the result.zip file. To unzip the result.zip file on Ubuntu, macOS, or Linux, run the following code.

unzip result.zip -d result

You can find the result files in the results/ folder. For example, if you run an lm-harness evaluation, the results are in automatic/lm_eval_harness/results.

The directory structure will look like this:

.
├── automatic
│   └── lm_eval_harness
│       ├── model_config_meta-llama-3_1-8b-instruct.yaml
│       ├── model_config_meta-llama-3_1-8b-instruct_inference_params.yaml
│       └── results
│           ├── README.md
│           ├── lm-harness-mmlu_str.json
│           ├── lm-harness.json
│           ├── lmharness_meta-llama-3_1-8b-instruct_aggregateresults-run.log
│           ├── lmharness_meta-llama-3_1-8b-instruct_mmlu_str-run.log
└── metadata.json

Big Code Evaluation Results#

Results are returned at the evaluation and task level. pass@k is a popular metric for evaluating functional correctness.

Evaluation results are returned in the following format.

{
  "created_at": "2025-03-21T16:12:16.938210",
  "updated_at": "2025-03-21T16:12:16.938211",
  "id": "evaluation_result-1234ABCD5678EFGH",
  "job": "eval-UVW123XYZ456",
  "tasks": {
    "pass@1": {
      "metrics": {
        "pass@1": {
          "scores": {
            "humaneval": {
              "value": 0.159756097560976
            }
          }
        }
      }
    },
    "pass@1_stderr": {
      "metrics": {
        "pass@1_stderr": {
          "scores": {
            "humaneval-metric_ranking-1": {
              "value": 0.0128023429085295
            }
          }
        }
      }
    }
  },
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "pass@1": {
              "value": 0.159756097560976
            },
            "pass@1_stderr": {
              "value": 0.0128023429085295
            }
          }
        }
      }
    }
  },
  "namespace": "default",
  "custom_fields": {

  }
}

LM Evaluation Harness Evaluation Results#

Results are returned at the evaluation and task level. Evaluation results are returned in the following format.

{
  "created_at": "2025-03-19T21:12:58.789224",
  "updated_at": "2025-03-19T21:12:58.789226",
  "id": "evaluation_result-1234ABCD5678EFGH",
  "job": "eval-UVW123XYZ456",
  "tasks": {
    "exact_match": {
      "metrics": {
        "exact_match": {
          "scores": {
            "gsm8k_cot_llama-metric_ranking-1": {
              "value": 0.309325246398787
            },
            "gsm8k_cot_llama-metric_ranking-3": {
              "value": 0.374526156178923
            }
          }
        }
      }
    },
    "exact_match_stderr": {
      "metrics": {
        "exact_match_stderr": {
          "scores": {
            "gsm8k_cot_llama-metric_ranking-2": {
              "value": 0.0127317109250781
            },
            "gsm8k_cot_llama-metric_ranking-4": {
              "value": 0.0133317741584914
            }
          }
        }
      }
    }
  },
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "exact_match": {
              "value": 0.341925701288855
            },
            "exact_match_stderr": {
              "value": 0.0130317425417848
            }
          }
        }
      }
    }
  },
  "namespace": "default",
  "custom_fields": {

  }
}

Similarity Metrics Evaluation Results#

The NeMo Evaluator job returns aggregated evaluation results for each of the scorers (metrics) that you specified in the configuration.

Evaluation results are returned in the following format for each scorer.

{
  "created_at": "2025-03-05T17:03:01.643861",
  "updated_at": "2025-03-05T17:03:01.643862",
  "id": "evaluation_result-1234ABCD5678EFGH",
  "job": "eval-UVW123XYZ456",
  "tasks": {

  },
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "accuracy": {
              "value": 0.0444444444444444
            },
            "bleu_score": {
              "value": 0.0813085085759745
            },
            "rouge_1_score": {
              "value": 0.277633859731954
            },
            "rouge_2_score-metric_ranking-1": {
              "value": 0.139289906138245
            },
            "rouge_3_score-metric_ranking-2": {
              "value": 0.0591258646114323
            },
            "rouge_L_score-metric_ranking-3": {
              "value": 0.272577935264265
            }
          }
        }
      }
    }
  },
  "namespace": "default",
  "custom_fields": {

  }
}

LLM-as-a-Judge Evaluation Results#

The following python script can be used to download the generated results:

import huggingface_hub as hh
import requests

url = "<NeMo Data Store URL>"
token = "mock"
repo_name = "<evaluation id>"
download_path = "<Path where results will be downloaded>"

repo_name = f'nvidia/{repo_name}'

api = hh.HfApi(endpoint=url, token=token)
repo_type = 'dataset'
api.snapshot_download(repo_id=repo_name, repo_type=repo_type, local_dir=download_path, local_dir_use_symlinks=False)

The downloaded results directory will have the following structure:

|-- mt_bench
|	| -- model_answer
|	|	| -- <llm_name>.jsonl
|	| -- model_judgement
|	|	| -- <llm_name>.jsonl
|	| -- reference_answer
|	|	| -- <reference>.jsonl
|	| -- question.jsonl
|	| -- judge_prompts.jsonl
|-- results
|   | -- <llm_name>.csv
  • User LLM answers: mt_bench/model_answer/<llm_name>.jsonl file with with User LLM responses for each prompt in the evaluation dataset

  • Judge LLM responses: mt_bench/model_judgment/<llm_name>.jsonl file containing the Judge LLM ratings, with explanation, for each User LLM answer

  • Aggregated scores: Aggregated scores are returned as a .csv with the following structure -

    Category

    Score out of 10

    total

    1.57

    humanities

    2.4

    reasoning

    1.0

    writing

    1.3

    coding

    1.3

    stem

    2.1

    roleplay

    1.73

    math

    1.0

    extraction

    2.0

    turn 1

    1.64

    turn 2

    1.09

  • Each row in the csv describes a score from 1 to 10, where 1 signifies the weakest evaluation and 10 the strongest, for a given category of evaluation

  • The total refers to the average score across all categories

  • The turn 1 and turn 2 scores are average scores for the respective turns

  • For custom evaluations, the categories will follow what’s provided in the custom dataset

  • The Judge LLM must provide ratings in a specific format of [[rating]]. A warning will be provided in the .csv file if the Jude LLM failed to generate a rating in the required format for one or more questions. The inference parameters can be adjusted or a different Judge LLM can be used in this case

Custom prompt for custom dataset#

When custom dataset is provided for judgement, chances are that users want to provide more guidelines for the judge for example, let judge understand the context about the question so that judge can do better job in making the decision.

reference.jsonl can be utialized for more usages instead of using as a reference answer since the reference entry will be inserted into each prompt.

Here are two use cases with judgement only evaluation.

use case 1: provide background knowledge to the judge#

In this case, users want to provide some background knowledge to the judge for judging the model answer.

In judge_prompt.jsonl, we need to modify the prompt_template accordingly:

{"name": "single-ref-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. You will be provided some background knowledge in the context section and should check if the answer match the background knowledge. After providing your explanation, you must rate the response on a float scale of 0 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[3]]\".\n\n[Question]\n{question}\n\n[The Start of Context]\n{ref_answer_1}\n[The End of Context]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": ["general"], "output_format": "[[rating]]"}

Similarly, we can apply to multi-turn prompt_template as well.

In reference.jsonl, we can provide background knowledge for each question-answer pair. Example:

{"question_id": "100", "choices": [{"index": 0, "turns": ["The user has been very sick recently"]}]}

This background knowledge will be placed in {ref_answer_1}.

Similarly, we can apply to multi-turn reference as well.

use case 2: provide multiple custom prompt to the judge#

We can then expand use case 1 to create highly customized prompt for each question by utilizing reference.jsonl. Suppose we have 3 types of custom prompt needed for judge.

  1. ground truth: the expert answer to the question. Judge should compare the model answer with the ground truth to have a better judgement.

  2. context: the background knowledge that judge should know for judging the model answer. Check out use case 1.

  3. assertion: some assertions for judge model to verify if the model answer has covered certain aspects. For example, users want the model answer to use respectful language or explain with an example.

For each question and answer pair, we would like the judge to consider all the 3 types of context above and create custom prompts for each model answer. It is feasible by minor tweaks.

Suppose we have stored the information in a csv file:

question_id, ground_truth, context, assertions
100, "First, you should go to see the doctor and have a thorough medical exam. Next, based on the doctor's suggestion, take medicines and rest.", "This user has been sick recently.", "Does the answer suggest go to see the doctor?"

After data transformation, we can have the following reference.jsonl:

{"question_id": "100", "choices": [{"index": 0, "turns": ["Assertions [model answer should cover the following key points]: Does the answer suggest go to see the doctor?\n [An expert's answer as a reference]: First, you should go to see the doctor and have a thorough medical exam. Next, based on the doctor's suggestion, take medicines and rest. \n [Context about the user]: This user has been sick recently.", "null"]}]}

we should update the judge_prompt.jsonl a bit as well:

{"name": "single-ref-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. You will be provided three types of context in the context section: ground truth, context and assertion. After analyzing all the context, you must rate the response on a float scale of 0 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[3]]\".\n\n[Question]\n{question}\n\n[The Start of Context]\n{ref_answer_1}\n[The End of Context]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": ["general"], "output_format": "[[rating]]"}

In this way, we can customize the prompt with more or less context or other information that judge needs to know.

Retriever Pipeline Evaluation Results#

Evaluation results are returned in the following format.

  1. recall@k: Recall at k is calculated as the fraction of the relevant documents that are successfully retrieved within the top k extracted documents. Higher values indicate better performance. k is set by the user, with acceptable value ranging from 1 to the top_k value of the retriever model.

  2. ndcg@k/ndcg_cut_k: Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). Higher values indicate better performance. k is set by the user, with acceptable value ranging from 1 to the top_k value of the retriever model.

{
    "created_at": "2025-03-29T07:16:54.298605",
    "updated_at": "2025-03-29T07:16:54.298607",
    "id": "evaluation_result-1234ABCD5678EFGH",
    "job": "eval-UVW123XYZ456",
    "tasks": {},
    "groups": {
        "evaluation": {
            "metrics": {
                "evaluation": {
                    "scores": {
                        "recall_10": {
                            "value": 0.5219448247226026
                        },
                        "ndcg_cut_5": {
                            "value": 0.43118519470524036
                        },
                        "ndcg_cut_10": {
                            "value": 0.4548908807830673
                        },
                        "recall_5": {
                            "value": 0.4455075402992067
                        }
                    }
                }
            }
        }
    },
    "namespace": "default",
    "custom_fields": {}
}

RAG Pipeline Evaluation Results#

Evaluation Results are returned in the following format.

{
    "created_at": "2025-03-29T07:16:54.298605",
    "updated_at": "2025-03-29T07:16:54.298607",
    "id": "evaluation_result-1234ABCD5678EFGH",
    "job": "eval-UVW123XYZ456",
    "tasks": {},
    "groups": {
        "evaluation": {
            "metrics": {
                "evaluation": {
                    "scores": {
                        "recall_10": {
                            "value": 0.5219448247226026
                        },
                        "ndcg_cut_5": {
                            "value": 0.43118519470524036
                        },
                        "ndcg_cut_10": {
                            "value": 0.4548908807830673
                        },
                        "recall_5": {
                            "value": 0.4455075402992067
                        },
                        "faithfulness": {
                            "value": 0.7811975946247776
                        }
                    }
                }
            }
        }
    },
    "namespace": "default",
    "custom_fields": {}
}

Document retrieval:

  • recall@k: Recall at k is calculated as the fraction of the relevant documents that are successfully retrieved within the top k extracted documents. Higher values indicate better performance. k is set by the user, with acceptable value ranging from 1 to the top_k value of the Retriever.

  • ndcg@k/ndcg_cut_k: Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). Higher values indicate better performance. k is set by the user, with acceptable value ranging from 1 to the top_k value of the Retriever.

Answer generation:

  • faithfulness: Measures the factual consistency of the generated answer against the provided context. The score ranges from 0 to 1, with higher values indicating greater accuracy. This metric uses a judge_llm. This metric can be used when dataset_format is set to beir or squad. This metric can be used when dataset_format is set to ragas and the columns question, answer, and contexts are present in the data.

  • answer_relevancy: Measures how relevant the generated answer is to the given prompt, evaluated using an LLM and an Embedding model. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric uses a judge_llm and a judge_embeddings. This metric can be used when dataset_format is set to beir or squad. This metric can be used when dataset_format is set to ragas and the columns question and answer are present in the data.

  • answer_correctness: Accuracy of the generated answer when compared to the ground truth. The score ranges from 0 to 1, with higher values indicating better correctness. This metric uses a judge_llm and a judge_embeddings. This metric can not be used when dataset_format is set to beir or squad. This metric can be used when dataset_format is set to ragas and the columns question, answer, and ground_truth are present in the data.

  • answer_similarity: Semantic similarity between the generated answer and the ground truth. The score ranges from 0 to 1, with higher values indicating better alignment. This metric uses a judge_llm and a judge_embeddings. This metric can not be used when dataset_format is set to beir or squad. This metric can be used when dataset_format is set to ragas and the columns ground_truth and answer are present in the data.

  • context_precision: Are ground-truth relevant items ranked higher or not. The score ranges from 0 to 1, with higher values indicate better precision. This metric uses a judge_llm. This metric can not be used when dataset_format is set to beir or squad. This metric can be used when dataset_format is set to ragas and the columns question, contexts, and ground_truth are present in the data.

  • context_recall: Does the retrieved context aligns with the ground_truth answer. The score ranges from 0 to 1, with higher values indicating better performance. This metric uses a judge_llm. This metric can not be used when dataset_format is set to beir or squad. This metric can be used when dataset_format is set to ragas and the columns question, contexts, and ground_truth are present in the data.

Visualize Evaluation Results with Weights and Biases or MLFlow#

You can use Weights and Biases and MLFlow python scripts located in NVIDIA NGC to upload evaluation results to supported visualization tools. Use the following procedure to get the scripts.

  1. Use the following NGC CLI code to download the zip file that contains the scripts.

    ngc registry resource download-version "nvidia/nemo-microservices/evaluator_results_scripts:0.1.0"
    
  2. Unzip the script files.

    cd evaluator_results_scripts_v0.1.0
    unzip integrations.zip
    

Use the following procedure to upload evaluation results.

  1. Download the evaluation job results by using the download-results Evaluator API endpoint. For details, see Download Evaluation Results.

  2. Determine which data visualization tool you want to use, MLFlow or Weights and Biases, and verify that you have the MLFlow URI key or Weights and Biases API key.

  3. Follow the documentation for the visualization tool found in the Weights and Biases README (./integrations/w_and_b/ReadME.md) or MLFlow README (./integrations/MLFlow/ReadME.md) to prepare environment variables and dependencies for the scripts.

  4. Run the script by following the downloaded README. Ensure that the results that you downloaded earlier and specified in the script ends at the folder that contains the json or csv output file.

    • Example command for weights and biases: python3 w_and_b_eval_integration.py --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" --experiment_name="<EXPERIMENT_NAME>"

    • Example command for MLFlow: python3 mlflow_eval_integration.py --results_abs_dir "<ABSOLUTE_PATH_TO_DOWNLOADED_RESULTS>/bigcode_latest/automatic/bigcode_latest/results/" --mlflow_uri "<MLFLOW_URI>:<MLFLOW_PORT>" --experiment_name="<EXPERIMENT_NAME>"