Run and Manage Evaluation Jobs#

After you create an evaluation target and an evaluation configuration, you are ready to run an evaluation job in NVIDIA NeMo Evaluator.

Evaluation Job Workflow#

Use the following procedure to run an evaluation in NeMo Evaluator.

(Optional) If you are using a custom dataset for evaluation, upload it to NeMo Data Store. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.
Prepare a target for your evaluation. You can create a new, or reuse an existing target that you have prepared previously. Record the ID of the target that you want to use for your evaluation. For more information, refer to Create and Manage Evaluation Targets.
Prepare a configuration for your evaluation. You can create a new, or reuse an existing configuration that you have prepared previously. Record the ID of the configuration that you want to use for your evaluation. For more information, refer to Create and Manage Evaluation Configurations.
Run your evaluation by creating a job that includes the target ID and configuration ID from the previous steps. For more information, refer to Example Evaluation Job.
Get your results. For more information, refer to Use the Results of Your Job.

Tip

To see what targets and configurations are supported together, refer to Combine Evaluation Targets and Configurations.

Evaluator API URL#

To create an evaluation job, send a POST request to the evaluation/jobs API. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. For more information, refer to NeMo Evaluator Deployment Guide.

The examples in this documentation specify {EVALUATOR_HOSTNAME} in the code. Do the following to store the evaluator hostname to use it in your code.

Important

Replace <your evaluator service endpoint> with your address, such as evaluator.internal.your-company.com, before you run this code.

curl

export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"

Python

import requests

EVALUATOR_HOSTNAME = "<your evaluator service endpoint>" 

Example Evaluation Job#

Use the following code to create an evaluation job.

curl

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '
    {
      "namespace": "my-organization",
      "target": "<my-target-namespace/my-target-name>",
      "config": "<my-config-namespace/my-config-name>"
    }'

Python

data = {
   "namespace": "my-organization",
   "target": "<my-target-namespace/my-target-name>",
   "config": "<my-config-namespace/my-config-name>"
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"

# Make the API call
response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")

# Get the status.  You should see `CREATED` or `PENDING`, or `RUNNING`.
job_status = response['status']
print(f"Job status: {job_status}")

To see a sample response, refer to Create Job Response.

Job JSON Reference#

When you create a job for an evaluation, you send a JSON data structure that contains the information for your configuration.

Important

Each job is uniquely identified by a job_id that the Evaluator service creates. For example eval-1234ABCD5678EFGH.

The following table contains selected field reference for the JSON data. For the full API reference, refer to Evaluator API.

Name	Description	Type
config	The unique `namespace/name` that identifies the configuration to use for the evaluation. For more information, refer to Create and Manage Evaluation Configurations.	String
id	The ID of the job. The ID is returned in the response when you create a job.	String
namespace	An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is `default`.	String
target	The unique `namespace/name` that identifies the target to use for the evaluation. For more information, refer to Create and Manage Evaluation Targets.	String

Combine Evaluation Targets and Configurations#

Because NeMo Evaluator separates the target and the configuration, you can create a configuration once, and reuse it multiple times with different targets. You can also create a target once, and reuse it multiple times with different evaluations. The following table gives examples of how to combine targets and configurations for an evaluation job.

Evaluation Type	Example Configurations	Example Targets	Custom Data Options
BigCode Evaluation Harness	- Config	- LLM Model	—
LM Evaluation Harness	- Config	- LLM Model	—
Similarity Metrics	- Config	- LLM Model - Pre-generated Answers	- Custom Data
LLM As A Judge	- Config (MT-Bench)	- LLM Model - Pre-generated Answers	—
Retriever Pipeline	- Config (Embedding + Reranking) - Config (Custom Data)	- Target (Embedding Only) - Target (Embedding + Reranking)	BEIR, SQuAD
RAG Pipeline	- Config (Standard Data) - Config (Custom Data) - Config (Pre-generated Answers) - Config (OpenAI-compatible Judge LLM)	- Target (Embedding only) - Target (Embedding + Reranking) - Target Answer Evaluation - Target Answer Generation + Answer Evaluation	BEIR, SQuAD, Ragas

Caution

Before you can run a retriever evaluation, or a RAG evaluation that has a retriever step, you must set up a Milvus document store. For more information, refer to Configure Milvus.

Expected Evaluation Duration#

The time that an Evaluation job takes can vary from a few minutes to many hours, depending on the target model, config, and other factors. The following table contains some example evaluation durations.

Important

These are only a few possible Evaluation examples.

Example Evaluation	Example Models	Example Hardware	Example Dataset	Example Expected Time for Evaluation
LM Eval Harness (gsm8k task)	Inference: meta / llama-3.1-8b-instruct	1 A100	(academic) gsm8k dataset	5 - 10 Hours
Bigcode	Inference: meta / llama-3.1-8b-instruct	1 A100	(academic) humaneval dataset	1 - 5 Hours
Similarity metrics	Offline generated answers	—	20 answers	Minutes
Similarity metrics	Inference: meta / llama-3.1-8b-instruct	1 A100	113 questions / prompts	Minutes - 1 Hour
LLM-as-a-Judge	Inference: meta / llama-3.1-8b-instruct and judge: meta/llama-3.1-70b-instruct	5 A100s	(academic) mtbench dataset	1 - 5 Hours
LLM-as-a-Judge (judgement only with custom dataset)	Judge: meta/llama-3.1-70b-instruct	4 A100s	2 answers	Minutes
Retriever Evaluation (embedding only)	Embedding: nvidia/nv-embedqa-e5-v5	1 A100	(academic) fiqa dataset	Minutes
Retriever Evaluation (embedding + reranking)	Embedding: nvidia/nv-embedqa-e5-v5 and Reranking: nvidia/nv-rerankqa-mistral-4b-v3	2 A100s	(academic) fiqa dataset	Minutes
RAG (with pre-retrieved contexts)	Inference: meta / llama-3.1-70b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	9 A100s	3 questions	Minutes
RAG (with pre-generated answers)	Judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	5 A100s	322 questions	Minutes - 1 Hour
RAG (retriever with embedding only)	Embedding: nvidia/nv-embedqa-e5-v5, inference: meta / llama-3.1-8b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	7 A100s	(academic) fiqa dataset	1 - 5 Hours
RAG (retriever with embedding + reranking)	Embedding: nvidia/nv-embedqa-e5-v5, reranking: nvidia/nv-rerankqa-mistral-4b-v3, inference: meta / llama-3.1-70b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	11 A100s	(academic) fiqa dataset	1 - 5 Hours

Common Job Tasks#

The following sections demonstrate common tasks you might want to do with your Evaluation jobs.

Tip

These APIs support filtering, sorting, and pagination. For details, refer to Filter and Sort Responses from the NVIDIA NeMo Evaluator API.

List all Evaluation Jobs#

To list all evaluation jobs, send a GET request to the jobs endpoint, as shown in the following code. After you submit the request, a list of all the evaluation jobs is returned, with details for each job. To filter the jobs that the API returns, refer to Filter Jobs.

curl

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
  -H 'accept: application/json'

Python

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"
response = requests.get(endpoint).json()
response

Get the Status of an Evaluation Job#

To get the status of an evaluation job, send a GET request to the jobs endpoint, as shown in the following code.

curl

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status" \
  -H 'accept: application/json'

Python

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()

# Get the status.
job_status = response['status']
print(f"Job status: {job_status}")

The response is similar to the following.

{
    "message": "Job completed successfully",
    "task_status": {},
    "progress": null
}

The message is one of the following:

Job is pending
Job is running
Job completed successfully
Unable to launch the evaluation job because there is a problem in the target, config, or environment

Get the Details of an Evaluation Job#

To get the full details of an evaluation job, send a GET request to the jobs endpoint, as shown in the following code.

curl

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>" \
  -H 'accept: application/json'

Python

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>"
response = requests.get(endpoint).json()

# Get the status.
job_status = response['status']
print(f"Job status: {job_status}")

The response is similar to the following.

{
    "created_at": "2025-03-19T22:50:15.684382",
    "updated_at": "2025-03-19T22:50:15.684385",
    "id": "eval-UVW123XYZ456",
    "namespace": "my-organization",
    "description": null,
    "target": {
        //target details
    },
    "config": {
        // config details
    },
    "result": "evaluation_result-1234ABCD5678EFGH",
    "output_files_url": "hf://datasets/evaluation-results/eval-UVW123XYZ456",
    "status_details": {
        "message": "Job completed successfully",
        "task_status": {},
        "progress": null
    },
    "status": "completed",
    "project": null,
    "custom_fields": {},
    "ownership": null
}

The status field is one of the following:

CREATED — The job is created, but not yet scheduled.
PENDING — The job is waiting for resource allocation.
RUNNING — The job is currently running.
COMPLETED — The job has completed successfully.
CANCELLED — The job has been cancelled by the user.
FAILED — The job failed to run and terminated.

Delete an Evaluation Job#

To delete an evaluation job, send a DELETE request to the jobs endpoint. You must provide the ID of the job as shown in the following code.

curl

curl -X "DELETE" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>" \
  -H 'accept: application/json'

Python

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>"
response = requests.delete(endpoint).json()
response

Create Job Response#

When you create an evaluation job, the response is similar to the following. This is also the structure of the response for each job when you list all jobs or get the details for a single job.

For the full response reference, refer to Evaluator API.

{
    "created_at": "2025-03-19T22:50:15.684382",
    "updated_at": "2025-03-19T22:50:15.684385",
    "id": "eval-UVW123XYZ456",
    "namespace": "my-organization",
    "description": null,
    "target": {
        //target details
    },
    "config": {
        // config details
    },
    "result": null,
    "output_files_url": null,
    "status_details": {
        "message": null,
        "task_status": {},
        "progress": null
    },
    "status": "created",
    "project": null,
    "custom_fields": {},
    "ownership": null
}