Run and Manage Evaluation Jobs#
After you create an evaluation target and an evaluation configuration, you are ready to run an evaluation job in NVIDIA NeMo Evaluator.
Evaluation Job Workflow#
Use the following procedure to run an evaluation in NeMo Evaluator.
(Optional) If you are using a custom dataset for evaluation, upload it to NeMo Data Store. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.
Prepare a target for your evaluation. You can create a new, or reuse an existing target that you have prepared previously. Record the ID of the target that you want to use for your evaluation. For more information, refer to Create and Manage Evaluation Targets.
Prepare a configuration for your evaluation. You can create a new, or reuse an existing configuration that you have prepared previously. Record the ID of the configuration that you want to use for your evaluation. For more information, refer to Create and Manage Evaluation Configurations.
Run your evaluation by creating a job that includes the target ID and configuration ID from the previous steps. For more information, refer to Example Evaluation Job.
Get your results. For more information, refer to Use the Results of Your Job.
Tip
To see what targets and configurations are supported together, refer to Combine Evaluation Targets and Configurations.
Evaluator API URL#
To create an evaluation job, send a POST
request to the evaluation/jobs
API.
The URL of the evaluator API depends on where you deploy evaluator and how you configure it.
For more information, refer to NeMo Evaluator Deployment Guide.
The examples in this documentation specify {EVALUATOR_HOSTNAME}
in the code.
Do the following to store the evaluator hostname to use it in your code.
Important
Replace <your evaluator service endpoint>
with your address, such as evaluator.internal.your-company.com
, before you run this code.
export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
import requests
EVALUATOR_HOSTNAME = "<your evaluator service endpoint>"
Example Evaluation Job#
Use the following code to create an evaluation job.
curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"namespace": "my-organization",
"target": "<my-target-namespace/my-target-name>",
"config": "<my-config-namespace/my-config-name>"
}'
data = {
"namespace": "my-organization",
"target": "<my-target-namespace/my-target-name>",
"config": "<my-config-namespace/my-config-name>"
}
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"
# Make the API call
response = requests.post(endpoint, json=data).json()
# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")
# Get the status. You should see `CREATED` or `PENDING`, or `RUNNING`.
job_status = response['status']
print(f"Job status: {job_status}")
To see a sample response, refer to Create Job Response.
Job JSON Reference#
When you create a job for an evaluation, you send a JSON data structure that contains the information for your configuration.
Important
Each job is uniquely identified by a job_id
that the Evaluator service creates. For example eval-1234ABCD5678EFGH
.
The following table contains selected field reference for the JSON data. For the full API reference, refer to Evaluator API.
Name |
Description |
Type |
---|---|---|
config |
The unique |
String |
id |
The ID of the job. The ID is returned in the response when you create a job. |
String |
namespace |
An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is |
String |
target |
The unique |
String |
Combine Evaluation Targets and Configurations#
Because NeMo Evaluator separates the target and the configuration, you can create a configuration once, and reuse it multiple times with different targets. You can also create a target once, and reuse it multiple times with different evaluations. The following table gives examples of how to combine targets and configurations for an evaluation job.
Evaluation Type |
Example Configurations |
Example Targets |
Custom Data Options |
---|---|---|---|
BigCode Evaluation Harness |
- Config |
— |
|
LM Evaluation Harness |
- Config |
— |
|
Similarity Metrics |
- Config |
||
LLM As A Judge |
— |
||
Retriever Pipeline |
|||
RAG Pipeline |
- Config (Standard Data) |
- Target (Embedding only) |
Caution
Before you can run a retriever evaluation, or a RAG evaluation that has a retriever step, you must set up a Milvus document store. For more information, refer to Configure Milvus.
Expected Evaluation Duration#
The time that an Evaluation job takes can vary from a few minutes to many hours, depending on the target model, config, and other factors. The following table contains some example evaluation durations.
Important
These are only a few possible Evaluation examples.
Example Evaluation |
Example Models |
Example Hardware |
Example Dataset |
Example Expected Time for Evaluation |
---|---|---|---|---|
LM Eval Harness (gsm8k task) |
Inference: meta / llama-3.1-8b-instruct |
1 A100 |
(academic) gsm8k dataset |
5 - 10 Hours |
Bigcode |
Inference: meta / llama-3.1-8b-instruct |
1 A100 |
(academic) humaneval dataset |
1 - 5 Hours |
Similarity metrics |
Offline generated answers |
— |
20 answers |
Minutes |
Similarity metrics |
Inference: meta / llama-3.1-8b-instruct |
1 A100 |
113 questions / prompts |
Minutes - 1 Hour |
LLM-as-a-Judge |
Inference: meta / llama-3.1-8b-instruct and judge: meta/llama-3.1-70b-instruct |
5 A100s |
(academic) mtbench dataset |
1 - 5 Hours |
LLM-as-a-Judge (judgement only with custom dataset) |
Judge: meta/llama-3.1-70b-instruct |
4 A100s |
2 answers |
Minutes |
Retriever Evaluation (embedding only) |
Embedding: nvidia/nv-embedqa-e5-v5 |
1 A100 |
(academic) fiqa dataset |
Minutes |
Retriever Evaluation (embedding + reranking) |
Embedding: nvidia/nv-embedqa-e5-v5 and Reranking: nvidia/nv-rerankqa-mistral-4b-v3 |
2 A100s |
(academic) fiqa dataset |
Minutes |
RAG (with pre-retrieved contexts) |
Inference: meta / llama-3.1-70b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
9 A100s |
3 questions |
Minutes |
RAG (with pre-generated answers) |
Judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
5 A100s |
322 questions |
Minutes - 1 Hour |
RAG (retriever with embedding only) |
Embedding: nvidia/nv-embedqa-e5-v5, inference: meta / llama-3.1-8b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
7 A100s |
(academic) fiqa dataset |
1 - 5 Hours |
RAG (retriever with embedding + reranking) |
Embedding: nvidia/nv-embedqa-e5-v5, reranking: nvidia/nv-rerankqa-mistral-4b-v3, inference: meta / llama-3.1-70b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
11 A100s |
(academic) fiqa dataset |
1 - 5 Hours |
Common Job Tasks#
The following sections demonstrate common tasks you might want to do with your Evaluation jobs.
Tip
These APIs support filtering, sorting, and pagination. For details, refer to Filter and Sort Responses from the NVIDIA NeMo Evaluator API.
List all Evaluation Jobs#
To list all evaluation jobs, send a GET
request to the jobs endpoint, as shown in the following code.
After you submit the request, a list of all the evaluation jobs is returned, with details for each job.
To filter the jobs that the API returns, refer to Filter Jobs.
curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
-H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"
response = requests.get(endpoint).json()
response
Get the Status of an Evaluation Job#
To get the status of an evaluation job, send a GET
request to the jobs endpoint, as shown in the following code.
curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status" \
-H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()
# Get the status.
job_status = response['status']
print(f"Job status: {job_status}")
The response is similar to the following.
{
"message": "Job completed successfully",
"task_status": {},
"progress": null
}
The message is one of the following:
Job is pending
Job is running
Job completed successfully
Unable to launch the evaluation job because there is a problem in the target, config, or environment
Get the Details of an Evaluation Job#
To get the full details of an evaluation job, send a GET
request to the jobs endpoint, as shown in the following code.
curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>" \
-H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>"
response = requests.get(endpoint).json()
# Get the status.
job_status = response['status']
print(f"Job status: {job_status}")
The response is similar to the following.
{
"created_at": "2025-03-19T22:50:15.684382",
"updated_at": "2025-03-19T22:50:15.684385",
"id": "eval-UVW123XYZ456",
"namespace": "my-organization",
"description": null,
"target": {
//target details
},
"config": {
// config details
},
"result": "evaluation_result-1234ABCD5678EFGH",
"output_files_url": "hf://datasets/evaluation-results/eval-UVW123XYZ456",
"status_details": {
"message": "Job completed successfully",
"task_status": {},
"progress": null
},
"status": "completed",
"project": null,
"custom_fields": {},
"ownership": null
}
The status
field is one of the following:
CREATED
— The job is created, but not yet scheduled.PENDING
— The job is waiting for resource allocation.RUNNING
— The job is currently running.COMPLETED
— The job has completed successfully.CANCELLED
— The job has been cancelled by the user.FAILED
— The job failed to run and terminated.
Delete an Evaluation Job#
To delete an evaluation job, send a DELETE
request to the jobs endpoint.
You must provide the ID of the job as shown in the following code.
curl -X "DELETE" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>" \
-H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>"
response = requests.delete(endpoint).json()
response
Create Job Response#
When you create an evaluation job, the response is similar to the following. This is also the structure of the response for each job when you list all jobs or get the details for a single job.
For the full response reference, refer to Evaluator API.
{
"created_at": "2025-03-19T22:50:15.684382",
"updated_at": "2025-03-19T22:50:15.684385",
"id": "eval-UVW123XYZ456",
"namespace": "my-organization",
"description": null,
"target": {
//target details
},
"config": {
// config details
},
"result": null,
"output_files_url": null,
"status_details": {
"message": null,
"task_status": {},
"progress": null
},
"status": "created",
"project": null,
"custom_fields": {},
"ownership": null
}