Run and Manage Evaluation Jobs#

After you create an evaluation target and an evaluation configuration, you are ready to run an evaluation job in NVIDIA NeMo Evaluator.

Evaluation Job Workflow#

Use the following procedure to run an evaluation in NeMo Evaluator.

  1. (Optional) If you are using a custom dataset for evaluation, upload it to NeMo Data Store. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.

  2. Prepare a target for your evaluation. You can create a new, or reuse an existing target that you have prepared previously. Record the ID of the target that you want to use for your evaluation. For more information, refer to Create and Manage Evaluation Targets.

  3. Prepare a configuration for your evaluation. You can create a new, or reuse an existing configuration that you have prepared previously. Record the ID of the configuration that you want to use for your evaluation. For more information, refer to Create and Manage Evaluation Configurations.

  4. Run your evaluation by creating a job that includes the target ID and configuration ID from the previous steps. For more information, refer to Example Evaluation Job.

  5. Get your results. For more information, refer to Use the Results of Your Job.

Tip

To see what targets and configurations are supported together, refer to Combine Evaluation Targets and Configurations.

Evaluator API URL#

To create an evaluation job, send a POST request to the evaluation/jobs API. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. For more information, refer to NeMo Evaluator Deployment Guide.

The examples in this documentation specify {EVALUATOR_HOSTNAME} in the code. Do the following to store the evaluator hostname to use it in your code.

Important

Replace <your evaluator service endpoint> with your address, such as evaluator.internal.your-company.com, before you run this code.

export EVALUATOR_HOSTNAME="<your evaluator service endpoint>"
import requests

EVALUATOR_HOSTNAME = "<your evaluator service endpoint>" 

Example Evaluation Job#

Use the following code to create an evaluation job.

curl -X "POST" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '
    {
      "namespace": "my-organization",
      "target": "<my-target-namespace/my-target-name>",
      "config": "<my-config-namespace/my-config-name>"
    }'
data = {
   "namespace": "my-organization",
   "target": "<my-target-namespace/my-target-name>",
   "config": "<my-config-namespace/my-config-name>"
}

endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"

# Make the API call
response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")

# Get the status.  You should see `CREATED` or `PENDING`, or `RUNNING`.
job_status = response['status']
print(f"Job status: {job_status}")

To see a sample response, refer to Create Job Response.

Job JSON Reference#

When you create a job for an evaluation, you send a JSON data structure that contains the information for your configuration.

Important

Each job is uniquely identified by a job_id that the Evaluator service creates. For example eval-1234ABCD5678EFGH.

The following table contains selected field reference for the JSON data. For the full API reference, refer to Evaluator API.

Name

Description

Type

config

The unique namespace/name that identifies the configuration to use for the evaluation. For more information, refer to Create and Manage Evaluation Configurations.

String

id

The ID of the job. The ID is returned in the response when you create a job.

String

namespace

An arbitrary organization name, a vendor name, or any other text. If you don’t specify a namespace, the default is default.

String

target

The unique namespace/name that identifies the target to use for the evaluation. For more information, refer to Create and Manage Evaluation Targets.

String

Combine Evaluation Targets and Configurations#

Because NeMo Evaluator separates the target and the configuration, you can create a configuration once, and reuse it multiple times with different targets. You can also create a target once, and reuse it multiple times with different evaluations. The following table gives examples of how to combine targets and configurations for an evaluation job.

Caution

Before you can run a retriever evaluation, or a RAG evaluation that has a retriever step, you must set up a Milvus document store. For more information, refer to Configure Milvus.

Expected Evaluation Duration#

The time that an Evaluation job takes can vary from a few minutes to many hours, depending on the target model, config, and other factors. The following table contains some example evaluation durations.

Important

These are only a few possible Evaluation examples.

Example Evaluation

Example Models

Example Hardware

Example Dataset

Example Expected Time for Evaluation

LM Eval Harness (gsm8k task)

Inference: meta / llama-3.1-8b-instruct

1 A100

(academic) gsm8k dataset

5 - 10 Hours

Bigcode

Inference: meta / llama-3.1-8b-instruct

1 A100

(academic) humaneval dataset

1 - 5 Hours

Similarity metrics

Offline generated answers

20 answers

Minutes

Similarity metrics

Inference: meta / llama-3.1-8b-instruct

1 A100

113 questions / prompts

Minutes - 1 Hour

LLM-as-a-Judge

Inference: meta / llama-3.1-8b-instruct and judge: meta/llama-3.1-70b-instruct

5 A100s

(academic) mtbench dataset

1 - 5 Hours

LLM-as-a-Judge (judgement only with custom dataset)

Judge: meta/llama-3.1-70b-instruct

4 A100s

2 answers

Minutes

Retriever Evaluation (embedding only)

Embedding: nvidia/nv-embedqa-e5-v5

1 A100

(academic) fiqa dataset

Minutes

Retriever Evaluation (embedding + reranking)

Embedding: nvidia/nv-embedqa-e5-v5 and Reranking: nvidia/nv-rerankqa-mistral-4b-v3

2 A100s

(academic) fiqa dataset

Minutes

RAG (with pre-retrieved contexts)

Inference: meta / llama-3.1-70b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

9 A100s

3 questions

Minutes

RAG (with pre-generated answers)

Judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

5 A100s

322 questions

Minutes - 1 Hour

RAG (retriever with embedding only)

Embedding: nvidia/nv-embedqa-e5-v5, inference: meta / llama-3.1-8b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

7 A100s

(academic) fiqa dataset

1 - 5 Hours

RAG (retriever with embedding + reranking)

Embedding: nvidia/nv-embedqa-e5-v5, reranking: nvidia/nv-rerankqa-mistral-4b-v3, inference: meta / llama-3.1-70b-instruct, judge llm: meta / llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

11 A100s

(academic) fiqa dataset

1 - 5 Hours

Common Job Tasks#

The following sections demonstrate common tasks you might want to do with your Evaluation jobs.

Tip

These APIs support filtering, sorting, and pagination. For details, refer to Filter and Sort Responses from the NVIDIA NeMo Evaluator API.

List all Evaluation Jobs#

To list all evaluation jobs, send a GET request to the jobs endpoint, as shown in the following code. After you submit the request, a list of all the evaluation jobs is returned, with details for each job. To filter the jobs that the API returns, refer to Filter Jobs.

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs" \
  -H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs"
response = requests.get(endpoint).json()
response

Get the Status of an Evaluation Job#

To get the status of an evaluation job, send a GET request to the jobs endpoint, as shown in the following code.

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status" \
  -H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>/status"
response = requests.get(endpoint).json()

# Get the status.
job_status = response['status']
print(f"Job status: {job_status}")

The response is similar to the following.

{
    "message": "Job completed successfully",
    "task_status": {},
    "progress": null
}

The message is one of the following:

  • Job is pending

  • Job is running

  • Job completed successfully

  • Unable to launch the evaluation job because there is a problem in the target, config, or environment

Get the Details of an Evaluation Job#

To get the full details of an evaluation job, send a GET request to the jobs endpoint, as shown in the following code.

curl -X "GET" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>" \
  -H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>"
response = requests.get(endpoint).json()

# Get the status.
job_status = response['status']
print(f"Job status: {job_status}")

The response is similar to the following.

{
    "created_at": "2025-03-19T22:50:15.684382",
    "updated_at": "2025-03-19T22:50:15.684385",
    "id": "eval-UVW123XYZ456",
    "namespace": "my-organization",
    "description": null,
    "target": {
        //target details
    },
    "config": {
        // config details
    },
    "result": "evaluation_result-1234ABCD5678EFGH",
    "output_files_url": "hf://datasets/evaluation-results/eval-UVW123XYZ456",
    "status_details": {
        "message": "Job completed successfully",
        "task_status": {},
        "progress": null
    },
    "status": "completed",
    "project": null,
    "custom_fields": {},
    "ownership": null
}

The status field is one of the following:

  • CREATED — The job is created, but not yet scheduled.

  • PENDING — The job is waiting for resource allocation.

  • RUNNING — The job is currently running.

  • COMPLETED — The job has completed successfully.

  • CANCELLED — The job has been cancelled by the user.

  • FAILED — The job failed to run and terminated.

Delete an Evaluation Job#

To delete an evaluation job, send a DELETE request to the jobs endpoint. You must provide the ID of the job as shown in the following code.

curl -X "DELETE" "http://${EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>" \
  -H 'accept: application/json'
endpoint = f"http://{EVALUATOR_HOSTNAME}/v1/evaluation/jobs/<job-id>"
response = requests.delete(endpoint).json()
response

Create Job Response#

When you create an evaluation job, the response is similar to the following. This is also the structure of the response for each job when you list all jobs or get the details for a single job.

For the full response reference, refer to Evaluator API.

{
    "created_at": "2025-03-19T22:50:15.684382",
    "updated_at": "2025-03-19T22:50:15.684385",
    "id": "eval-UVW123XYZ456",
    "namespace": "my-organization",
    "description": null,
    "target": {
        //target details
    },
    "config": {
        // config details
    },
    "result": null,
    "output_files_url": null,
    "status_details": {
        "message": null,
        "task_status": {},
        "progress": null
    },
    "status": "created",
    "project": null,
    "custom_fields": {},
    "ownership": null
}