Evaluate with LLM-as-a-Judge#
Use another LLM to evaluate outputs from your model or dataset with flexible scoring criteria. LLM-as-a-Judge is ideal for evaluating creative, complex, or domain-specific tasks where traditional metrics fall short.
Overview#
LLM-as-a-Judge evaluation works by sending your data to a “judge” LLM that scores responses according to criteria you define. You can evaluate:
Model outputs: Score how well a model responds to prompts
Pre-generated data: Evaluate existing question-answer pairs or conversations
Custom criteria: Define your own scoring rubrics or numerical ranges
NeMo Evaluator supports two evaluation modes:
Mode |
Use Case |
Response |
|---|---|---|
Live Evaluation |
Rapid prototyping, developing metrics, testing configurations. Dataset is limited to 10 rows. |
Immediate (synchronous) |
Job Evaluation |
Production workloads, full datasets |
Async (poll for completion) |
Prerequisites#
Before running LLM-as-a-Judge evaluations:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Judge LLM endpoint: Have access to an LLM that will serve as your judge (for example, a NIM endpoint or OpenAI-compatible API)
API key (if required): If your judge endpoint requires authentication, create a secret to store the API key. The secret must be in the same workspace where you run evaluations.
Initialize the SDK:
import os
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
Live Evaluation#
Tip
The model field accepts both inline model definitions and model references
(for example, "my-workspace/my-model"). Refer to Model Configuration for details.
Live evaluation is designed for rapid iteration when developing and refining your evaluation metrics. Use it to quickly test different judge prompts, scoring criteria, and data formats before committing to a full evaluation job. Results return immediately, making it easy to experiment and debug.
Basic Example with Range Scores#
Evaluate responses using numerical range scores (for example, 1-5 scale):
result = client.evaluation.metrics.evaluate(
dataset={
"rows": [
{
"input": "What is the capital of France?",
"output": "The capital of France is Paris."
},
{
"input": "How do I make coffee?",
"output": "Boil water, add grounds to filter, pour water over grounds, let it drip."
}
]
},
metric={
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "helpfulness",
"description": "How helpful is the response (1=not helpful, 5=extremely helpful)",
"minimum": 1,
"maximum": 5,
},
{
"name": "accuracy",
"description": "How accurate is the response (1=incorrect, 5=completely accurate)",
"minimum": 1,
"maximum": 5,
}
],
"prompt_template": {
"messages": [
{
"role": "system",
"content": "You are an expert judge. Rate each response on two dimensions (1-5 scale):\n- helpfulness: How useful is the response?\n- accuracy: How factually correct is the response?\n\nRespond with JSON: {\"helpfulness\": <1-5>, \"accuracy\": <1-5>}"
},
{
"role": "user",
"content": "Question: {{input}}\n\nResponse: {{output}}\n\nRate this response."
}
]
}
}
)
# Aggregate statistics across all rows
print(f"Metric: {result.metric}")
for score in result.aggregate_scores:
print(f" {score.name}: mean={score.mean:.2f}, count={score.count}")
# Per-row scores - useful for debugging and understanding individual results
print("\nPer-row scores:")
for row_result in result.row_scores:
print(f" Row {row_result.index}: {row_result.scores}")
The response includes both aggregate scores (statistics across all rows) and row scores (individual scores per row). For live evaluations, row scores are particularly valuable as they let you inspect exactly how the judge scored each input, making it easy to debug your metric configuration.
Example Response
# result.model_dump()
{
"metric": "quality-judge",
"aggregate_scores": [
{
"name": "helpfulness",
"count": 2,
"mean": 4.5,
"min": 4.0,
"max": 5.0
},
{
"name": "accuracy",
"count": 2,
"mean": 4.0,
"min": 3.0,
"max": 5.0
}
],
"row_scores": [
{
"index": 0,
"row": {"input": "What is the capital of France?", "output": "The capital of France is Paris."},
"scores": {"helpfulness": 5, "accuracy": 5}
},
{
"index": 1,
"row": {"input": "How do I make coffee?", "output": "Boil water, add grounds..."},
"scores": {"helpfulness": 4, "accuracy": 3}
}
]
}
Example with Rubric Scores#
Use rubric scores when you want categorical labels with explicit descriptions:
result = client.evaluation.metrics.evaluate(
dataset={
"rows": [
{"input": "Tell me a joke", "output": "Why did the chicken cross the road? To get to the other side!"},
{"input": "Explain quantum physics", "output": "I don't know."}
]
},
metric={
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "quality",
"description": "Overall quality of the response",
"rubric": [
{"label": "poor", "value": 0, "description": "Response is unhelpful or incorrect"},
{"label": "acceptable", "value": 1, "description": "Response is partially correct"},
{"label": "good", "value": 2, "description": "Response is correct and helpful"},
{"label": "excellent", "value": 3, "description": "Response is comprehensive and insightful"}
]
},
{
"name": "completeness",
"description": "How complete is the answer",
"rubric": [
{"label": "incomplete", "value": 0, "description": "Missing key information"},
{"label": "partial", "value": 1, "description": "Covers main points but lacks detail"},
{"label": "complete", "value": 2, "description": "Fully addresses the question"}
]
}
],
"prompt_template": {
"messages": [
{
"role": "system",
"content": "You are an expert judge. Rate each response:\n- quality: poor | acceptable | good | excellent\n- completeness: incomplete | partial | complete\n\nRespond with JSON: {\"quality\": \"<label>\", \"completeness\": \"<label>\"}"
},
{
"role": "user",
"content": "Question: {{input}}\n\nResponse: {{output}}\n\nRate this response."
}
]
}
}
)
Example Response with Rubric Distribution
# Request with additional aggregate fields
result = client.evaluation.metrics.evaluate(
dataset=dataset,
metric=metric,
aggregate_fields=["rubric_distribution", "mode_category"]
)
# result.aggregate_scores[0] for "quality"
{
"name": "quality",
"count": 2,
"mean": 1.5,
"rubric_distribution": [
{"label": "poor", "value": 0, "count": 1},
{"label": "acceptable", "value": 1, "count": 0},
{"label": "good", "value": 2, "count": 0},
{"label": "excellent", "value": 3, "count": 1}
],
"mode_category": "poor"
}
Custom Aggregate Fields#
By default, aggregate scores include count, mean, min, and max. Request additional statistics:
result = client.evaluation.metrics.evaluate(
dataset=dataset,
metric=metric,
aggregate_fields=["std_dev", "variance", "percentiles", "histogram"]
)
# Access extended statistics
for score in result.aggregate_scores:
print(f"{score.name}:")
print(f" Mean: {score.mean:.3f}")
print(f" Std Dev: {score.std_dev:.3f}")
print(f" Variance: {score.variance:.3f}")
if score.percentiles:
print(f" Median (p50): {score.percentiles.p50:.3f}")
print(f" p90: {score.percentiles.p90:.3f}")
Job-Based Evaluation#
For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.
Create an Evaluation Job#
Evaluate pre-generated outputs stored in a dataset:
job = client.evaluation.metric_jobs.create(
spec={
"metric": {
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "quality",
"description": "Overall quality of the response",
"rubric": [
{"label": "poor", "value": 0, "description": "Response is unhelpful"},
{"label": "good", "value": 1, "description": "Response is helpful"},
{"label": "excellent", "value": 2, "description": "Response is exceptional"}
],
"parser": {"type": "json", "json_path": "quality"}
}
],
"prompt_template": {
"messages": [
{
"role": "system",
"content": "Rate the response quality: poor, good, or excellent.\nRespond with JSON: {\"quality\": \"<label>\"}"
},
{
"role": "user",
"content": "Question: {{input}}\n\nResponse: {{output}}"
}
]
}
},
"dataset": {
"files_url": "hf://datasets/<workspace>/<dataset-name>"
},
"params": {
"parallelism": 16,
"limit_samples": 100 # Optional: limit for testing
}
}
)
print(f"Job created: {job.name} ({job.id})")
Reference a previously created metric by its unique workspace and metric name:
# First, create and store the metric
client.evaluation.metrics.create(
name="my-quality-judge",
type="llm-judge",
model={
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
scores=[
{
"name": "quality",
"minimum": 1,
"maximum": 5,
"parser": {"type": "json", "json_path": "quality"}
}
],
prompt_template={
"messages": [
{"role": "system", "content": "Rate quality 1-5. Respond: {\"quality\": <1-5>}"},
{"role": "user", "content": "{{input}}\n{{output}}"}
]
}
)
# Then use it in a job by metric reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
spec={
"metric": "default/my-quality-judge",
"dataset": {"files_url": "hf://datasets/<workspace>/<dataset-name>"},
"params": {"parallelism": 16}
}
)
Use inline rows for quick testing before running on full datasets:
job = client.evaluation.metric_jobs.create(
spec={
"metric": {
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "informativeness",
"rubric": [
{"label": "uninformative", "value": 0},
{"label": "informative", "value": 1}
],
"parser": {"type": "json", "json_path": "informativeness"}
}
],
"prompt_template": {
"messages": [
{"role": "system", "content": "Is this response informative? Reply: {\"informativeness\": \"uninformative\" or \"informative\"}"},
{"role": "user", "content": "{{output}}"}
]
}
},
"dataset": {
"rows": [
{"output": "Paris is the capital of France."},
{"output": "I don't know."}
]
}
}
)
Monitor Job Progress#
import time
while True:
job_status = client.evaluation.metric_jobs.get_status(job.name)
print(f"Status: {job_status.status}")
if job_status.status in ["completed", "error", "cancelled"]:
break
time.sleep(5)
Retrieve Results#
# List available results for the job
results_list = client.evaluation.metric_jobs.results.list(job.name)
print(f"Available results: {[r.name for r in results_list.data]}")
# Get a specific result
result = client.evaluation.metric_jobs.results.retrieve(
name="evaluation_results",
job=job.name
)
print(result.model_dump_json(indent=2, exclude_none=True))
Example Job Results
{
"aggregate_scores": [
{
"name": "quality",
"count": 100,
"mean": 1.2,
"min": 0,
"max": 2
}
]
}
Score Configuration#
LLM-as-a-Judge supports two types of scores: range scores (numerical ratings) and rubric scores (categorical classifications).
Choosing Between Range and Rubric Scores#
We recommend using rubric scores over range scores for most evaluation tasks. Classification-based rubrics (for example, pass/fail, safe/unsafe, poor/good/excellent) typically outperform numerical scoring (1-10) because:
Reduces ambiguity: Categorical labels with explicit descriptions are easier for judge models to apply consistently than numerical scales
Aligns with human reasoning: People naturally think in categories rather than precise numerical gradations
Avoids calibration issues: Numerical scores suffer from inconsistent calibration—one judge’s “7” may be another’s “5”
Provides actionable insights: Clear categories (for example, “needs_improvement”, “acceptable”, “excellent”) are more actionable than abstract numbers
More reliable metrics: Classification tasks produce more consistent and reproducible results across different judge models
Use range scores when:
You need fine-grained distinctions that do not map well to categories
You are measuring continuous quantities (for example, latency, word count)
Downstream analysis requires numerical operations on scores
Use rubric scores when:
You are evaluating quality dimensions (helpfulness, accuracy, safety)
Clear decision boundaries exist (pass/fail, compliant/non-compliant)
Results will guide human decisions or workflows
Range Scores#
Use range scores for numerical ratings on a continuous scale:
{
"name": "relevance",
"description": "How relevant is the response (1=irrelevant, 5=highly relevant)",
"minimum": 1,
"maximum": 5,
"parser": {"type": "json", "json_path": "relevance"}
}
Rubric Scores#
Use rubric scores for categorical evaluations with explicit criteria:
{
"name": "sentiment",
"description": "Sentiment of the response",
"rubric": [
{"label": "negative", "value": -1, "description": "Response has negative tone"},
{"label": "neutral", "value": 0, "description": "Response is neutral"},
{"label": "positive", "value": 1, "description": "Response has positive tone"}
],
"parser": {"type": "json", "json_path": "sentiment"}
}
Tip
Rubric scores use structured outputs by default, which constrains the judge model to output valid JSON. This significantly reduces parsing errors.
Score Parsers#
Configure how scores are extracted from judge responses:
Parser Type |
Use Case |
Example Pattern |
|---|---|---|
|
Judge outputs JSON (default) |
|
|
Extract from free-form text |
|
By default, the JSON parser is used for range and rubric scores, with the score name as the json_path to extract the value from.
# JSON parser (default)
"parser": {"type": "json", "json_path": "quality"}
# Regex parser (for models that do not support structured output)
"parser": {"type": "regex", "pattern": "QUALITY: (\\w+)"}
# Regex parser with method='search' (finds pattern anywhere in text)
"parser": {"type": "regex", "pattern": "SCORE: (\\d+)", "method": "search"}
Tip
Regex method options:
match(default): Matches the pattern only at the beginning of the text. Use when your prompt instructs the judge to output the score first.search: Finds the pattern anywhere in the text. Uses the first match of the regex found in the judge output.
For example, with method: "search" and pattern SCORE: (\d+), the parser can extract the score from:
The response is accurate and well-written. SCORE: 5
This would fail with the default match method since “SCORE:” is not at the beginning. If multiple matches exist, search returns the first occurrence.
Custom Judge Prompts#
Customize the judge prompt to match your evaluation criteria. Use Jinja2 templating to access data fields and score definitions.
Template Variables#
Variable |
Description |
|---|---|
|
Input field from dataset row |
|
Output field from dataset row |
|
Any field from the dataset row |
|
Model-generated response (when evaluating a model) |
|
Dictionary of score definitions |
Example: Custom Judge Template#
JUDGE_TEMPLATE = """You are an expert evaluator assessing AI assistant responses.
Evaluate the response on these criteria:
{% for score_name, score in scores.items() %}
- {{ score_name }}{% if score.description %}: {{ score.description }}{% endif %}
{% if score.rubric %}
Options: {% for r in score.rubric %}{{ r.name }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endif %}
{% endfor %}
Respond with JSON containing your ratings.
"""
metric = {
"type": "llm-judge",
"model": {
"url": "<judge-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "clarity",
"description": "How clear and understandable is the response",
"rubric": [
{"label": "confusing", "value": 0, "description": "Hard to understand"},
{"label": "clear", "value": 1, "description": "Easy to understand"},
{"label": "crystal_clear", "value": 2, "description": "Exceptionally well explained"}
],
"parser": {"type": "json", "json_path": "clarity"}
}
],
"prompt_template": {
"messages": [
{"role": "system", "content": JUDGE_TEMPLATE},
{"role": "user", "content": "Question: {{input}}\n\nResponse: {{output}}"}
]
}
}
Managing Secrets for Authenticated Endpoints#
If your judge model endpoint requires an API key, store it as a secret. The secret is automatically resolved from the same workspace as your evaluation.
Create a Secret#
# Create a secret with your API key
client.secrets.create(
name="judge-api-key",
data="your-api-key-here"
)
Reference the Secret in Your Metric#
metric = {
"type": "llm-judge",
"model": {
"url": "https://api.example.com/v1",
"name": "gpt-4",
"format": "openai",
"api_key_secret": "judge-api-key" # Just the secret name
},
# ... scores and prompt_template
}
Inference Parameters#
Control judge model behavior with inference parameters:
"prompt_template": {
"messages": [...],
"temperature": 0.1, # Lower for more consistent scoring
"max_tokens": 1024, # Increase if judge needs more space
"timeout": 30, # Request timeout in seconds
"stop": ["<|end_of_text|>"] # Stop sequences
}
Note
The default value for max_tokens for judge models is set to 1024. It is highly recommended to set an appropriate value for your judge model based on the expected outputs (for example, structured_output is used by default to format model output, ensure your max_tokens is set to accommodate the full JSON output). Incomplete JSON outputs will cause parsing errors and result in NaN score values.
Reasoning Model Configuration#
For reasoning-enabled models (like Nemotron), configure reasoning parameters:
metric = {
"type": "llm-judge",
"model": {
"url": "<nim-url>/v1",
"name": "nvidia/llama-3.3-nemotron-super-49b-v1",
"format": "nim"
},
# ... scores ...
"system_prompt": "'detailed thinking on'",
"reasoning": {
"end_token": "</think>"
},
"prompt_template": {
"messages": [...],
"temperature": 0.1,
"max_tokens": 4096
}
}
Limitations#
Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
NaN Scores: If the judge output cannot be parsed, the score is marked as
NaN. Common causes:Insufficient
max_tokens(check for"finish_reason": "length"in results)Judge model not following output format instructions
Use structured outputs or explicit format instructions to reduce NaN rates
Structured Output Requirement: Rubric scores require the judge model to support guided decoding. If your judge does not support this, use regex parsers with explicit format instructions.
Live Evaluation Limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.
See also
Model Configuration - Inline models vs model references
Evaluation Results - Understanding and downloading results
Agentic Evaluation - Evaluate agent workflows
RAG Evaluation - Evaluate retrieval-augmented generation