@nemo-nb: hide
@nemo-nb: hide
Evaluate Response Quality with LLM-as-a-Judge
LLM-as-a-Judge is a technique where you use a large language model to evaluate the outputs of another model. Instead of relying solely on automated metrics or manual review, you prompt a capable LLM to score responses based on criteria you define, such as helpfulness, accuracy, or tone.
This tutorial shows you how to build, validate, and iterate on LLM judge metrics using HelpSteer2, NVIDIA’s human-annotated dataset. By comparing your judge’s scores against human annotations, you can measure how well your judge aligns with human judgment and improve it through prompt iteration.
What you will learn:
- Create and test LLM judge metrics
- Load rows from a registered fileset into evaluator SDK requests
- Validate judge accuracy against human annotations
- Iterate on prompts to improve correlation with humans
- Visualize score distributions with histograms and percentiles
Quality dimensions evaluated:
This tutorial takes approximately 20 minutes to complete.
Prerequisites
- Install and start NeMo Platform using the Setup guide.
For PyPI installs, use the nemo-platform wrapper package. If you are working from a source checkout, run make bootstrap from the repository root instead.
- Install the Python libraries used in this tutorial:
Key Concepts
Before you begin, here is a quick overview of the resources you will use:
- Evaluator resource: The plugin SDK resource mounted at
client.evaluator. Use it to run metrics locally or submit durable platform jobs. - Metric: An inline Python object that defines how to score model outputs. In this tutorial, we create LLM judge metrics that prompt a model to rate responses.
- Fileset: A dataset registered with NeMo Platform. The evaluator plugin SDK accepts fileset references directly, so this tutorial passes the registered HelpSteer2 split to evaluations as a
FilesetRef. - Workspace: A workspace that isolates your resources. Secrets, filesets, and jobs belong to a workspace.
- Job: A durable remote platform task created with
evaluator.submit(...). - Evaluation: The process of scoring model outputs using one or more metrics. Use
evaluator.run(...)for local in-process execution,evaluator.submit(...)for durable jobs
1. Initialize the SDK and Create a Workspace
Create a dedicated workspace for this tutorial to keep your resources isolated from other projects:
2. Create Secrets
Create a platform secret for remote jobs and keep the same API key available in your local environment for local runs.
Get your NVIDIA_API_KEY to access the models on NVIDIA Build:
- NVIDIA Build API Key
- Steps: click “Generate API Key”
Local runs resolve Model(api_key_secret="NVIDIA_API_KEY") from your local environment. Remote jobs run in the platform job runtime, so they use the platform secret name created above.
3. Configure the Judge Models
Import the evaluator SDK types and configure judge models for local and remote execution. The judge is the LLM that evaluates responses.
This tutorial uses nvidia/nemotron-3-nano-30b-a3b from NVIDIA Build.
When using hosted APIs, keep parallelism low to avoid rate-limit errors. You can increase it for locally deployed models.
4. Register and Load the Dataset
Register HelpSteer2 as a fileset. The fileset remains the durable dataset registration.
HelpSteer2 contains prompt-response pairs with human ratings for helpfulness, correctness, coherence, complexity, and verbosity. Each rating is on a 0-4 scale. We’ll use these human scores as ground truth to validate our LLM judge.
Create a fileset reference for the validation split. The evaluator SDK resolves the selected fileset path when it runs, so you do not need to download the split into memory yourself:
5. Create an Initial Helpfulness Metric
Now let’s create our first LLM judge metric. We’ll start with a simple prompt for the helpfulness dimension, then improve it based on validation results.
A metric definition includes:
- Model: Which LLM to use as the judge
- Prompt template: Instructions for the judge, with
{{item...}}fields filled from your dataset rows - Score definition: The name, scale, and how to parse the judge’s output
Use low temperature for evaluation tasks. Low or zero temperature produces outputs with less variability, which is critical for reproducible scoring. This ensures the same response gets the same score across runs, making it easier to validate your judge and compare prompt versions.
6. Test with Local Evaluation
Before running a durable job, test your metric with a few examples using evaluator.run(...). This runs locally in-process and returns results immediately, which is useful for prompt iteration.
Expected output:
The first response is comprehensive and helpful, while the second is unhelpfully brief. If your judge produces reasonable scores here, you’re ready to run a larger evaluation.
7. Run Evaluation and Validate Against Ground Truth
Now let’s evaluate a larger sample and compare the judge’s predictions against human annotations. This tells us how well our judge aligns with human judgment.
Monitor the job and download the result when it completes:
Extract Scores
Row-level results contain the original dataset item, captured requests, and extracted metric scores:
Calculate Correlation with Human Annotations
To measure judge quality, we compare its scores against human annotations using three metrics:
If your V1 results show low correlation, that is expected. The basic prompt often produces scores that don’t align well with human judgment. We’ll improve this in the next step.
8. Improve the Prompt and Compare
The basic prompt may lack specificity. Let’s create an improved version that aligns with HelpSteer2’s annotation guidelines, which define helpfulness as “the overall helpfulness of the response to the prompt.”
Run evaluation with the improved prompt:
Compare Both Versions
More complex prompts do not always perform better. The best prompt depends on the model, task, and how well it aligns with the original annotation guidelines. If your V1 prompt outperforms V2, that’s a valid result. Use what works best for your use case.
9. Visualize Score Distributions
Visualizations help you understand how your judge differs from humans. Does it tend to score higher? Lower? Cluster around certain values?
If you run this tutorial as a headless script instead of a notebook, configure a non-interactive Matplotlib backend before importing pyplot:
In that mode, save the figure with plt.savefig(...) and skip plt.show().
The code above generates a chart showing score distributions. Your results will vary depending on the model used and the specific samples evaluated.
Percentile Analysis
Percentiles reveal how scores are distributed across the range:
If your judge’s distribution looks very different from humans, such as always scoring 3-4 while humans use the full range, adjust your prompt to calibrate the scoring criteria.
13. Clean Up
To delete the workspace, you must first delete all resources within it. Delete jobs first, then filesets, secrets, and the workspace.
Workspaces cannot be deleted while they contain resources. The code above deletes resources in dependency order.
Troubleshooting
Connection refused or “Cannot connect to host”
The platform isn’t running. Start it with:
Wait for all services to be healthy before running the tutorial. Check health status with:
Workspace already exists
If you’re re-running the tutorial, delete the existing workspace first:
Local NVIDIA Build authentication fails
Local evaluator runs resolve api_key_secret from environment variables. For NVIDIA Build, make sure NVIDIA_API_KEY is exported in the environment where the notebook or Python process is running:
The local model should use the environment variable name:
Remote jobs should use the platform secret name instead:
Job stuck in “pending” or “running” for too long
Check the job status from the job resource:
Remote jobs can report progress: 100.0 and all samples processed before the platform job status changes to completed. Wait for job.wait_until_done() to return before downloading results or treating the job as terminal.
Common causes:
- Judge model not deployed or unreachable
- Remote job is using a missing platform secret
- Rate limiting from external APIs
Low correlation with human annotations
If your Pearson r is below 0.4:
- Refine your prompt: Add more specific scoring criteria and examples
- Check score distribution: If the judge clusters around one value, the prompt may be too vague
- Try a different model: Larger judge models often correlate better with humans
- Verify data alignment: Ensure ground truth rows match evaluation results
JSON parsing errors in scores
If scores show None or the job fails with parsing errors:
- Verify the prompt explicitly asks for JSON output
- Check that
json_pathin the parser matches the key in your expected JSON - Lower the temperature to reduce malformed outputs
- Add “Respond with JSON only” to your system prompt
Hugging Face dataset access issues
For gated or private datasets, create a secret with your Hugging Face token:
Then reference it in the fileset:
Summary
In this tutorial, you learned how to:
- Create LLM judge metrics that prompt a model to score responses
- Use registered fileset references for plugin SDK execution
- Test quickly with local evaluation before running durable jobs
- Validate against ground truth by comparing with human annotations
- Iterate on prompts to improve correlation
- Visualize distributions to understand scoring patterns
Key takeaway: Prompt engineering matters for judge accuracy. Always validate your judge against human-labeled data when available, and iterate on your prompts to maximize alignment with human judgment.
Next Steps
- Experiment with rubric scores: Use categorical rubrics instead of numeric ranges for more interpretable criteria
- Try different judge models: Larger models often correlate better with human judgment
- Explore other evaluation types: RAG evaluation or agentic evaluation
Related
- LLM-as-a-Judge Reference - Complete guide to judge configuration
- SDK Resources - Evaluator plugin SDK resource reference
- Manage Metrics - Using evaluator SDK metric objects