Run an LLM Judge Eval#
Learn how to run an LLM Judge evaluation over a custom dataset.
Evaluation type:
custom, using the LLM as a Judge flowDataset: HelpSteer2
Tip
This tutorial takes around 3 minutes to complete.
Prerequisites#
Set up Evaluator with Docker Compose and deploy meta/llama-3.2-3b-instruct.
If you do not have a GPU to deploy a model, use a hosted model such as one from build.nvidia.com (LLama 3.2 3b Instruct) as detailed below.
Store your service URLs as variables to use it in your code.
Base URL is the main endpoint for interacting with the NeMo Microservices Platform.
Inference base URL is the inference endpoint for models.
Nemo Data Store URL is service address for dataset storage and exposes a Hugging-Face-compatible API.
from nemo_microservices import NeMoMicroservices # Set variables and initialize the client client = NeMoMicroservices( base_url="http://localhost:8080", inference_base_url="http://localhost:8000", ) nemo_data_store_url = "http://localhost:3000" HF_TOKEN = "<your readonly Hugging Face token>"
export NEMO_MICROSERVICES_BASE_URL="http://localhost:8080" export NIM_BASE_URL="http://localhost:8000" export NEMO_DATASTORE_URL="http://localhost:3000" export HF_TOKEN="<your HF token>" export DATASET_ID="default/helpsteer2"
Update the URLs accordingly for your deployment:
If you have Evaluator deployed following the Demo Cluster Setup on minikube
Base URL: http://nemo.test
Inference base URL: http://nim.test
If you have Evaluator deployed to a Kubernetes cluster (Evaluator individually or the production platform), update the URLs to the ingress setup.
If you are using a hosted model from build.nvidia.com
Inference base URL: https://integrate.api.nvidia.com
Verify service URLs before starting the tutorial.
jobs = client.v2.evaluation.jobs.list() print(jobs) import requests resp = requests.get(f"{client.inference_base_url}/v1/models") print(resp)
curl ${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs {"object":"list","data":[],"pagination":{}} curl ${NIM_BASE_URL}/v1/models { "object": "list", "data": [{ "id": "meta/llama-3.2-3b-instruct", "object": "model", "created": 1760457209, "owned_by": "system", "root": "meta/llama-3.2-3b-instruct", "parent": null, "max_model_len": 131072, "permission": [] }] }
1. Prepare Your Dataset#
First, we’ll prepare a custom dataset from HelpSteer2 by extracting only the prompt and response columns for evaluation. Later, We will compare the LLM judge’s predictions with the original metrics.
Download and process the dataset.
import requests import pandas as pd # Download the HelpSteer2 dataset from Hugging Face df = pd.read_json("hf://datasets/nvidia/HelpSteer2/train.jsonl.gz", lines=True) # Extract only the prompt and response columns for evaluation df = df[["prompt", "response"]].head(30) # Save to a local file file_name = "helpsteer2.jsonl" df.to_json(file_name, orient="records", lines=True) print(f"Dataset prepared with {len(df)} samples") print(f"Sample data:") print(df.head())
Upload dataset to NeMo Data Store.
import os from huggingface_hub import HfApi hf_api = HfApi(endpoint=f"{nemo_data_store_url}/v1/hf", token=HF_TOKEN) dataset_id = "default/helpsteer2" # Create the dataset repo if it doesn't exist hf_api.create_repo(repo_id=dataset_id, repo_type="dataset", exist_ok=True) # Upload the file result = hf_api.upload_file( path_or_fileobj=file_name, path_in_repo=file_name, repo_id=dataset_id, repo_type="dataset", revision="main", commit_message=f"Eval dataset in {dataset_id}" ) print(f"Dataset uploaded: {result}")
2. Submit the Evaluation Job#
Configure the judge model.
judge_model = {
"api_endpoint": {
"url": f"{client.inference_base_url}/v1/chat/completions",
"model_id": "meta/llama-3.2-3b-instruct"
}
}
If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the judge model by ID.
judge_model = "meta/llama-3.1-8b-instruct"
If you use a hosted model such as one from build.nvidia.com, configure the judge model with the URL and API key for the hosted model.
judge_model = {
"api_endpoint": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"model_id": "meta/llama-3.2-3b-instruct",
"api_key": "<your build.nvidia.com API key>"
}
}
Configure and create your job.
config = {
"type": "custom",
"tasks": {
"my-helpsteer2-task": {
"type": "data",
"metrics": {
"my-llm-judge-metric": {
"type": "llm-judge",
"params": {
"model": judge_model,
"template": {
"messages": [
{"role": "system", "content": "You are an expert evaluator for answers to user queries. Your task is to assess responses to user queries based on helpfulness, relevance, accuracy, and clarity."},
{"role": "user", "content": "Calculate the following metrics for the response: User Query: {{item.prompt}} Model Response: {{item.response}} Metrics: 1. Helpfulness (0-4): How well does the response help the user? 2. Correctness (0-4): Is the information correct? 3. Coherence (0-4): Is the response logically consistent and well-structured? 4. Complexity (0-4): How sophisticated is the response? 5. Verbosity (0-4): Is the response appropriately detailed? Instructions: Assign a score from 0 (poor) to 4 (excellent) for each metric."}
]
},
"structured_output": {
"schema": {
"type": "object",
"properties": {
"helpfulness": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"correctness": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"coherence": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"complexity": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"verbosity": {
"type": "integer",
"minimum": 0,
"maximum": 4
}
},
"required": ["helpfulness", "correctness", "coherence", "complexity", "verbosity"],
"additionalProperties": False
}
},
"scores": {
"helpfulness": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "helpfulness"
}
},
"correctness": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "correctness"
}
},
"coherence": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "coherence"
}
},
"complexity": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "complexity"
}
},
"verbosity": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "verbosity"
}
}
}
}
}
}
}
}
}
target = {"type": "dataset", "dataset": {"files_url": f"hf://datasets/{dataset_id}"}}
job = client.v2.evaluation.jobs.create(
spec={
"target": target,
"config": config
}
)
job_id = job.id
Configure the judge model.
export JUDGE_MODEL='{
"api_endpoint": {
"url": "'${NIM_BASE_URL}'",
"model_id": "meta/llama-3.2-3b-instruct"
}
}'
If you have Evaluator deployed following the Demo Cluster Setup on minikube, configure the judge model by ID.
export JUDGE_MODEL='"meta/llama-3.1-8b-instruct"'
If you use a hosted model such as one from build.nvidia.com, configure the judge model with the URL and API key for the hosted model.
export JUDGE_MODEL='{
"api_endpoint": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"model_id": "meta/llama-3.2-3b-instruct",
"api_key": "<your build.nvidia.com API key>"
}
}'
curl -X POST "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"spec": {
"target": {"type": "dataset", "dataset": {"files_url": "hf://datasets/'${DATASET_ID}'"}},
"config": {
"type": "custom",
"tasks": {
"my-helpsteer2-task": {
"type": "data",
"metrics": {
"my-llm-judge-metric": {
"type": "llm-judge",
"params": {
"model": '${JUDGE_MODEL}',
"template": {
"messages": [
{"role": "system", "content": "You are an expert evaluator for answers to user queries. Your task is to assess responses to user queries based on helpfulness, relevance, accuracy, and clarity."},
{"role": "user", "content": "Calculate the following metrics for the response: User Query: {{item.prompt}} Model Response: {{item.response}} Metrics: 1. Helpfulness (0-4): How well does the response help the user? 2. Correctness (0-4): Is the information correct? 3. Coherence (0-4): Is the response logically consistent and well-structured? 4. Complexity (0-4): How sophisticated is the response? 5. Verbosity (0-4): Is the response appropriately detailed? Instructions: Assign a score from 0 (poor) to 4 (excellent) for each metric."}
]
},
"structured_output": {
"schema": {
"type": "object",
"properties": {
"helpfulness": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"correctness": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"coherence": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"complexity": {
"type": "integer",
"minimum": 0,
"maximum": 4
},
"verbosity": {
"type": "integer",
"minimum": 0,
"maximum": 4
}
},
"required": ["helpfulness", "correctness", "coherence", "complexity", "verbosity"],
"additionalProperties": false
}
},
"scores": {
"helpfulness": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "helpfulness"
}
},
"correctness": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "correctness"
}
},
"coherence": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "coherence"
}
},
"complexity": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "complexity"
}
},
"verbosity": {
"type": "integer",
"parser": {
"type": "json",
"json_path": "verbosity"
}
}
}
}
}
}
}
}
}
}
}'
Jobs are uniquely identified by an id which can be seen in the JSON response. It can be helpful to set the value of the job ID to a variable in the subsequent steps:
export JOB_ID=<id returned in the JSON response>
echo "Job ID: $JOB_ID"
3. Get the Status of Your Evaluation Job#
To get the status of the evaluation job that you submitted in the previous step, use the following code.
job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
print(f"Job status: {job_status}")
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
-H 'accept: application/json'
You receive a response similar to the following, which contains the status of each task, and the percentage of progress already completed. For more information, refer to Get Evaluation Job Status.
{
"job_id": "job-uddkbn85bw7fnuyf1vq626",
"status": "active",
"status_details": {},
"error_details": null,
"steps": [
{
"name": "evaluation",
"status": "created",
"status_details": {},
"error_details": {},
"tasks": []
},
{
"name": "target-dataset",
"status": "completed",
"tasks": [
{
"id": "733048a672454e17ba31daa5de9f8029",
"status": "completed",
"status_details": {},
"error_details": {},
"error_stack": null
}
]
}
]
}
Monitor the job until it completes.
job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
while job_status.status in ("active", "pending", "created"):
job_status = client.v2.evaluation.jobs.status.retrieve(job.id)
time.sleep(10)
print(job_status)
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/status" \
-H 'accept: application/json'
{
"job_id": "job-uddkbn85bw7fnuyf1vq626",
"status": "completed",
"status_details": {
"samples_processed": 1329,
"progress": 100
},
"error_details": null,
"steps": [
{
"name": "results",
"status": "completed"
},
{
"name": "evaluation",
"status": "completed"
}
]
}
4. View Evaluation Job Results#
Once the job has completed successfully, you can examine the results of the evaluation to analyze the LLM judge’s assessments.
As JSON#
View results as JSON.
# v2 - Get structured evaluation results
results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
print(results.model_dump_json(indent=2, exclude_none=True))
# v2 - List available result types
available_results = client.v2.evaluation.jobs.results.list(job_id)
print(f"Available results: {[r.result_name for r in available_results.data]}")
# Get structured evaluation results
curl -X "GET" "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
-H 'accept: application/json'
# List available result types
curl -X GET "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results" \
-H 'accept: application/json'
As Download#
Download results to a local file.
# v2 - Download job artifacts (includes logs, intermediate files, etc.)
artifacts_zip = client.v2.evaluation.jobs.results.artifacts.retrieve(job_id)
artifacts_zip.write_to_file("evaluation_artifacts.zip")
print("Saved artifacts to evaluation_artifacts.zip")
# v2 - Download evaluation results separately
eval_results = client.v2.evaluation.jobs.results.evaluation_results.retrieve(job_id)
with open("evaluation_results.json", "w") as f:
f.write(eval_results.model_dump_json(indent=2, exclude_none=True))
print("Saved results to evaluation_results.json")
# Download job artifacts
curl -X GET "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/artifacts/download" \
-H 'accept: application/zip' \
-o evaluation_artifacts.zip
# Download evaluation results
curl -X GET "${NEMO_MICROSERVICES_BASE_URL}/v2/evaluation/jobs/${JOB_ID}/results/evaluation-results/download" \
-H 'accept: application/json'