Run an LLM Judge Eval#

Learn how to run an LLM Judge evaluation over a custom dataset.

Evaluation type: custom, using the LLM as a Judge flow
Dataset: HelpSteer2

Tip

This tutorial takes around 3 minutes to complete.

Prerequisites#

Set up Evaluator with Docker Compose and deploy meta/llama-3.2-3b-instruct.
- If you do not have a GPU to deploy a model, use a hosted model such as one from build.nvidia.com (LLama 3.2 3b Instruct) as detailed below.
Store your service URLs as variables to use it in your code.
- Base URL is the main endpoint for interacting with the NeMo Microservices Platform.
- Inference base URL is the inference endpoint for models.
- Nemo Data Store URL is service address for dataset storage and exposes a Hugging-Face-compatible API.
Python SDK
from nemo_microservices import NeMoMicroservices # Set variables and initialize the client client = NeMoMicroservices( base_url="http://localhost:8080", inference_base_url="http://localhost:8000", ) nemo_data_store_url = "http://localhost:3000" HF_TOKEN = "<your readonly Hugging Face token>"
cURL
export NEMO_MICROSERVICES_BASE_URL="http://localhost:8080" export NIM_BASE_URL="http://localhost:8000" export NEMO_DATASTORE_URL="http://localhost:3000" export HF_TOKEN="<your HF token>" export DATASET_ID="default/helpsteer2"
Update the URLs accordingly for your deployment:
- If you have Evaluator deployed following the Demo Cluster Setup on minikube
  - Base URL: http://nemo.test
  - Inference base URL: http://nim.test
- If you have Evaluator deployed to a Kubernetes cluster (Evaluator individually or the production platform), update the URLs to the ingress setup.
- If you are using a hosted model from build.nvidia.com
  - Inference base URL: https://integrate.api.nvidia.com

Verify service URLs before starting the tutorial.

Python SDK

jobs = client.v2.evaluation.jobs.list()
print(jobs)

import requests
resp = requests.get(f"{client.inference_base_url}/v1/models")
print(resp)

cURL

curl ${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs
{"object":"list","data":[],"pagination":{}}

curl ${NIM_BASE_URL}/v1/models
{
  "object": "list",
  "data": [{
    "id": "meta/llama-3.2-3b-instruct",
    "object": "model",
    "created": 1760457209,
    "owned_by": "system",
    "root": "meta/llama-3.2-3b-instruct",
    "parent": null,
    "max_model_len": 131072,
    "permission": []
  }]
}

1. Prepare Your Dataset#

First, we’ll prepare a custom dataset from HelpSteer2 by extracting only the prompt and response columns for evaluation. Later, We will compare the LLM judge’s predictions with the original metrics.

Download and process the dataset.

import requests
import pandas as pd

# Download the HelpSteer2 dataset from Hugging Face
df = pd.read_json("hf://datasets/nvidia/HelpSteer2/train.jsonl.gz", lines=True)

# Extract only the prompt and response columns for evaluation
df = df[["prompt", "response"]].head(30)

# Save to a local file
file_name = "helpsteer2.jsonl"
df.to_json(file_name, orient="records", lines=True)

print(f"Dataset prepared with {len(df)} samples")
print(f"Sample data:")
print(df.head())

Upload dataset to NeMo Data Store.

import os
from huggingface_hub import HfApi

hf_api = HfApi(endpoint=f"{nemo_data_store_url}/v1/hf", token=HF_TOKEN)
dataset_id = "default/helpsteer2"

# Create the dataset repo if it doesn't exist
hf_api.create_repo(repo_id=dataset_id, repo_type="dataset", exist_ok=True)

# Upload the file
result = hf_api.upload_file(
    path_or_fileobj=file_name,
    path_in_repo=file_name,
    repo_id=dataset_id,
    repo_type="dataset",
    revision="main",
    commit_message=f"Eval dataset in {dataset_id}"
)

print(f"Dataset uploaded: {result}")

2. Submit the Evaluation Job#

Python SDK

Configure the judge model.

Model with Docker Compose

judge_model = {
  "api_endpoint": {
    "url": f"{client.inference_base_url}/v1/chat/completions",
    "model_id": "meta/llama-3.2-3b-instruct"
  }
}

Model with Minikube Demo Cluster

If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the judge model by ID.

judge_model = "meta/llama-3.1-8b-instruct"

Hosted Model

If you use a hosted model such as one from build.nvidia.com, configure the judge model with the URL and API key for the hosted model.

judge_model = {
  "api_endpoint": {
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "model_id": "meta/llama-3.2-3b-instruct",
    "api_key": "<your build.nvidia.com API key>"
  }
}

Configure and create your job.

config = {
  "type": "custom",
  "tasks": {
    "my-helpsteer2-task": {
      "type": "data",
      "metrics": {
        "my-llm-judge-metric": {
          "type": "llm-judge",
          "params": {
            "model": judge_model,
            "template": {
              "messages": [
                  {"role": "system", "content": "You are an expert evaluator for answers to user queries. Your task is to assess responses to user queries based on helpfulness, relevance, accuracy, and clarity."},
                  {"role": "user", "content": "Calculate the following metrics for the response: User Query: {{item.prompt}} Model Response: {{item.response}} Metrics: 1. Helpfulness (0-4): How well does the response help the user? 2. Correctness (0-4): Is the information correct? 3. Coherence (0-4): Is the response logically consistent and well-structured? 4. Complexity (0-4): How sophisticated is the response? 5. Verbosity (0-4): Is the response appropriately detailed? Instructions: Assign a score from 0 (poor) to 4 (excellent) for each metric."}
              ]
            },
            "structured_output": {
              "schema": {
                "type": "object",
                "properties": {
                  "helpfulness": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 4
                  },
                  "correctness": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 4
                  },
                  "coherence": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 4
                  },
                  "complexity": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 4
                  },
                  "verbosity": {
                    "type": "integer",
                    "minimum": 0,
                    "maximum": 4
                  }
                },
                "required": ["helpfulness", "correctness", "coherence", "complexity", "verbosity"],
                "additionalProperties": False
              }
            },
            "scores": {
              "helpfulness": {
                "type": "integer",
                "parser": {
                  "type": "json",
                  "json_path": "helpfulness"
                }
              },
              "correctness": {
                "type": "integer",
                "parser": {
                  "type": "json",
                  "json_path": "correctness"
                }
              },
              "coherence": {
                "type": "integer",
                "parser": {
                  "type": "json",
                  "json_path": "coherence"
                }
              },
              "complexity": {
                "type": "integer",
                "parser": {
                  "type": "json",
                  "json_path": "complexity"
                }
              },
              "verbosity": {
                "type": "integer",
                "parser": {
                  "type": "json",
                  "json_path": "verbosity"
                }
              }
            }
          }
        }
      }
    }
  }
}

target = {"type": "dataset", "dataset": {"files_url": f"hf://datasets/{dataset_id}"}}

job = client.v2.evaluation.jobs.create(
    spec={
        "target": target,
        "config": config
    }
)

job_id = job.id

cURL

Configure the judge model.

Model with Docker Compose

export JUDGE_MODEL='{
  "api_endpoint": {
    "url": "'${NIM_BASE_URL}'",
    "model_id": "meta/llama-3.2-3b-instruct"
  }
}'

Model with Minikube Demo Cluster

If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the judge model by ID.

export JUDGE_MODEL='"meta/llama-3.1-8b-instruct"'

Hosted Model

If you use a hosted model such as one from build.nvidia.com, configure the judge model with the URL and API key for the hosted model.

export JUDGE_MODEL='{
  "api_endpoint": {
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "model_id": "meta/llama-3.2-3b-instruct",
    "api_key": "<your build.nvidia.com API key>"
  }
}'

curl -X POST "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec": {
      "target": {"type": "dataset", "dataset": {"files_url": "hf://datasets/'${DATASET_ID}'"}},
      "config": {
        "type": "custom",
        "tasks": {
          "my-helpsteer2-task": {
            "type": "data",
            "metrics": {
              "my-llm-judge-metric": {
                "type": "llm-judge",
                "params": {
                  "model": '${JUDGE_MODEL}',
                  "template": {
                    "messages": [
                      {"role": "system", "content": "You are an expert evaluator for answers to user queries. Your task is to assess responses to user queries based on helpfulness, relevance, accuracy, and clarity."},
                      {"role": "user", "content": "Calculate the following metrics for the response: User Query: {{item.prompt}} Model Response: {{item.response}} Metrics: 1. Helpfulness (0-4): How well does the response help the user? 2. Correctness (0-4): Is the information correct? 3. Coherence (0-4): Is the response logically consistent and well-structured? 4. Complexity (0-4): How sophisticated is the response? 5. Verbosity (0-4): Is the response appropriately detailed? Instructions: Assign a score from 0 (poor) to 4 (excellent) for each metric."}
                    ]
                  },
                  "structured_output": {
                    "schema": {
                      "type": "object",
                      "properties": {
                        "helpfulness": {
                          "type": "integer",
                          "minimum": 0,
                          "maximum": 4
                        },
                        "correctness": {
                          "type": "integer",
                          "minimum": 0,
                          "maximum": 4
                        },
                        "coherence": {
                          "type": "integer",
                          "minimum": 0,
                          "maximum": 4
                        },
                        "complexity": {
                          "type": "integer",
                          "minimum": 0,
                          "maximum": 4
                        },
                        "verbosity": {
                          "type": "integer",
                          "minimum": 0,
                          "maximum": 4
                        }
                      },
                      "required": ["helpfulness", "correctness", "coherence", "complexity", "verbosity"],
                      "additionalProperties": false
                    }
                  },
                  "scores": {
                    "helpfulness": {
                      "type": "integer",
                      "parser": {
                        "type": "json",
                        "json_path": "helpfulness"
                      }
                    },
                    "correctness": {
                      "type": "integer",
                      "parser": {
                        "type": "json",
                        "json_path": "correctness"
                      }
                    },
                    "coherence": {
                      "type": "integer",
                      "parser": {
                        "type": "json",
                        "json_path": "coherence"
                      }
                    },
                    "complexity": {
                      "type": "integer",
                      "parser": {
                        "type": "json",
                        "json_path": "complexity"
                      }
                    },
                    "verbosity": {
                      "type": "integer",
                      "parser": {
                        "type": "json",
                        "json_path": "verbosity"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }'

Jobs are uniquely identified by an id which can be seen in the JSON response. It can be helpful to set the value of the job ID to a variable in the subsequent steps:

export JOB_ID=<id returned in the JSON response>

echo "Job ID: $JOB_ID"

3. Get the Status of Your Evaluation Job#

To get the status of the evaluation job that you submitted in the previous step, use the following code.

Python SDK

job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
print(f"Job status: {job_status}")

cURL

curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'

You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.

{
  "job_id": "job-uddkbn85bw7fnuyf1vq626",
  "status": "active",
  "status_details": {},
  "error_details": null,
  "steps": [
    {
      "name": "evaluation",
      "status": "created",
      "status_details": {},
      "error_details": {},
      "tasks": []
    },
    {
      "name": "target-dataset",
      "status": "completed",
      "tasks": [
        {
          "id": "733048a672454e17ba31daa5de9f8029",
          "status": "completed",
          "status_details": {},
          "error_details": {},
          "error_stack": null
        }
      ]
    }  
  ]
}

Monitor the job until it completes.

Python SDK

job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
while job_status.status in ("active", "pending", "created"):
    job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
    time.sleep(10)
print(job_status)

cURL

curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
  -H 'accept: application/json'

{
  "job_id": "job-uddkbn85bw7fnuyf1vq626",
  "status": "completed",
  "status_details": {
    "samples_processed": 1329,
    "progress": 100
  },
  "error_details": null,
  "steps": [
    {
      "name": "results",
      "status": "completed"
    },
    {
      "name": "evaluation",
      "status": "completed"
    }
  ]
}

4. View Evaluation Job Results#

Once the job has completed successfully, you can examine the results of the evaluation to analyze the LLM judge’s assessments.

As JSON#

View results as JSON.

Python SDK

# v2 - Get structured evaluation results
results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
print(results.model_dump_json(indent=2, exclude_none=True))

# v2 - List available result types
available_results = client.v2.evaluation.jobs.results.list(job_id)
print(f"Available results: {[r.result_name for r in available_results.data]}")

cURL

# Get structured evaluation results
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
  -H 'accept: application/json'

# List available result types
curl -X GET "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results" \
  -H 'accept: application/json'

As Download#

Download results to a local file.

Python SDK

# v2 - Download job artifacts (includes logs, intermediate files, etc.)
artifacts_zip = client.v2.evaluation.jobs.results.artifacts.retrieve(job_id)
artifacts_zip.write_to_file("evaluation_artifacts.zip")
print("Saved artifacts to evaluation_artifacts.zip")

# v2 - Download evaluation results separately
eval_results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
with open("evaluation_results.json", "w") as f:
    f.write(eval_results.model_dump_json(indent=2, exclude_none=True))
print("Saved results to evaluation_results.json")

cURL

# Download job artifacts
curl -X GET "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/artifacts/download" \
  -H 'accept: application/zip' \
  -o evaluation_artifacts.zip

# Download evaluation results
curl -X GET "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
  -H 'accept: application/json'