Evaluate with LLM-as-a-Judge#

Use another LLM to evaluate outputs from your model or dataset with flexible scoring criteria. LLM-as-a-Judge is ideal for evaluating creative, complex, or domain-specific tasks where traditional metrics fall short.

Overview#

LLM-as-a-Judge evaluation works by sending your data to a “judge” LLM that scores responses according to criteria you define. You can evaluate:

  • Model outputs: Score how well a model responds to prompts

  • Pre-generated data: Evaluate existing question-answer pairs or conversations

  • Custom criteria: Define your own scoring rubrics or numerical ranges

NeMo Evaluator supports two evaluation modes:

Mode

Use Case

Response

Live Evaluation

Rapid prototyping, developing metrics, testing configurations. Dataset is limited to 10 rows.

Immediate (synchronous)

Job Evaluation

Production workloads, full datasets

Async (poll for completion)

Prerequisites#

Before running LLM-as-a-Judge evaluations:

  1. Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.

  2. Judge LLM endpoint: Have access to an LLM that will serve as your judge (for example, a NIM endpoint or OpenAI-compatible API)

  3. API key (if required): If your judge endpoint requires authentication, create a secret to store the API key. The secret must be in the same workspace where you run evaluations.

  4. Initialize the SDK:

import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Live Evaluation#

Tip

The model field accepts both inline model definitions and model references (for example, "my-workspace/my-model"). Refer to Model Configuration for details.

Live evaluation is designed for rapid iteration when developing and refining your evaluation metrics. Use it to quickly test different judge prompts, scoring criteria, and data formats before committing to a full evaluation job. Results return immediately, making it easy to experiment and debug.

Basic Example with Range Scores#

Evaluate responses using numerical range scores (for example, 1-5 scale):

result = client.evaluation.metrics.evaluate(
    dataset={
        "rows": [
            {
                "input": "What is the capital of France?",
                "output": "The capital of France is Paris."
            },
            {
                "input": "How do I make coffee?",
                "output": "Boil water, add grounds to filter, pour water over grounds, let it drip."
            }
        ]
    },
    metric={
        "type": "llm-judge",
        "model": {
            "url": "<judge-nim-url>/v1",
            "name": "meta/llama-3.1-70b-instruct",
            "format": "nim"
        },
        "scores": [
            {
                "name": "helpfulness",
                "description": "How helpful is the response (1=not helpful, 5=extremely helpful)",
                "minimum": 1,
                "maximum": 5,
            },
            {
                "name": "accuracy",
                "description": "How accurate is the response (1=incorrect, 5=completely accurate)",
                "minimum": 1,
                "maximum": 5,
            }
        ],
        "prompt_template": {
            "messages": [
                {
                    "role": "system",
                    "content": "You are an expert judge. Rate each response on two dimensions (1-5 scale):\n- helpfulness: How useful is the response?\n- accuracy: How factually correct is the response?\n\nRespond with JSON: {\"helpfulness\": <1-5>, \"accuracy\": <1-5>}"
                },
                {
                    "role": "user",
                    "content": "Question: {{input}}\n\nResponse: {{output}}\n\nRate this response."
                }
            ]
        }
    }
)

# Aggregate statistics across all rows
print(f"Metric: {result.metric}")
for score in result.aggregate_scores:
    print(f"  {score.name}: mean={score.mean:.2f}, count={score.count}")

# Per-row scores - useful for debugging and understanding individual results
print("\nPer-row scores:")
for row_result in result.row_scores:
    print(f"  Row {row_result.index}: {row_result.scores}")

The response includes both aggregate scores (statistics across all rows) and row scores (individual scores per row). For live evaluations, row scores are particularly valuable as they let you inspect exactly how the judge scored each input, making it easy to debug your metric configuration.

Example Response
# result.model_dump()
{
    "metric": "quality-judge",
    "aggregate_scores": [
        {
            "name": "helpfulness",
            "count": 2,
            "mean": 4.5,
            "min": 4.0,
            "max": 5.0
        },
        {
            "name": "accuracy",
            "count": 2,
            "mean": 4.0,
            "min": 3.0,
            "max": 5.0
        }
    ],
    "row_scores": [
        {
            "index": 0,
            "row": {"input": "What is the capital of France?", "output": "The capital of France is Paris."},
            "scores": {"helpfulness": 5, "accuracy": 5}
        },
        {
            "index": 1,
            "row": {"input": "How do I make coffee?", "output": "Boil water, add grounds..."},
            "scores": {"helpfulness": 4, "accuracy": 3}
        }
    ]
}

Example with Rubric Scores#

Use rubric scores when you want categorical labels with explicit descriptions:

result = client.evaluation.metrics.evaluate(
    dataset={
        "rows": [
            {"input": "Tell me a joke", "output": "Why did the chicken cross the road? To get to the other side!"},
            {"input": "Explain quantum physics", "output": "I don't know."}
        ]
    },
    metric={
        "type": "llm-judge",
        "model": {
            "url": "<judge-nim-url>/v1",
            "name": "meta/llama-3.1-70b-instruct",
            "format": "nim"
        },
        "scores": [
            {
                "name": "quality",
                "description": "Overall quality of the response",
                "rubric": [
                    {"label": "poor", "value": 0, "description": "Response is unhelpful or incorrect"},
                    {"label": "acceptable", "value": 1, "description": "Response is partially correct"},
                    {"label": "good", "value": 2, "description": "Response is correct and helpful"},
                    {"label": "excellent", "value": 3, "description": "Response is comprehensive and insightful"}
                ]
            },
            {
                "name": "completeness",
                "description": "How complete is the answer",
                "rubric": [
                    {"label": "incomplete", "value": 0, "description": "Missing key information"},
                    {"label": "partial", "value": 1, "description": "Covers main points but lacks detail"},
                    {"label": "complete", "value": 2, "description": "Fully addresses the question"}
                ]
            }
        ],
        "prompt_template": {
            "messages": [
                {
                    "role": "system",
                    "content": "You are an expert judge. Rate each response:\n- quality: poor | acceptable | good | excellent\n- completeness: incomplete | partial | complete\n\nRespond with JSON: {\"quality\": \"<label>\", \"completeness\": \"<label>\"}"
                },
                {
                    "role": "user",
                    "content": "Question: {{input}}\n\nResponse: {{output}}\n\nRate this response."
                }
            ]
        }
    }
)
Example Response with Rubric Distribution
# Request with additional aggregate fields
result = client.evaluation.metrics.evaluate(
    dataset=dataset,
    metric=metric,
    aggregate_fields=["rubric_distribution", "mode_category"]
)

# result.aggregate_scores[0] for "quality"
{
    "name": "quality",
    "count": 2,
    "mean": 1.5,
    "rubric_distribution": [
        {"label": "poor", "value": 0, "count": 1},
        {"label": "acceptable", "value": 1, "count": 0},
        {"label": "good", "value": 2, "count": 0},
        {"label": "excellent", "value": 3, "count": 1}
    ],
    "mode_category": "poor"
}

Custom Aggregate Fields#

By default, aggregate scores include count, mean, min, and max. Request additional statistics:

result = client.evaluation.metrics.evaluate(
    dataset=dataset,
    metric=metric,
    aggregate_fields=["std_dev", "variance", "percentiles", "histogram"]
)

# Access extended statistics
for score in result.aggregate_scores:
    print(f"{score.name}:")
    print(f"  Mean: {score.mean:.3f}")
    print(f"  Std Dev: {score.std_dev:.3f}")
    print(f"  Variance: {score.variance:.3f}")
    if score.percentiles:
        print(f"  Median (p50): {score.percentiles.p50:.3f}")
        print(f"  p90: {score.percentiles.p90:.3f}")

Job-Based Evaluation#

For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.

Create an Evaluation Job#

Evaluate pre-generated outputs stored in a dataset:

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": {
            "type": "llm-judge",
            "model": {
                "url": "<judge-nim-url>/v1",
                "name": "meta/llama-3.1-70b-instruct",
                "format": "nim"
            },
            "scores": [
                {
                    "name": "quality",
                    "description": "Overall quality of the response",
                    "rubric": [
                        {"label": "poor", "value": 0, "description": "Response is unhelpful"},
                        {"label": "good", "value": 1, "description": "Response is helpful"},
                        {"label": "excellent", "value": 2, "description": "Response is exceptional"}
                    ],
                    "parser": {"type": "json", "json_path": "quality"}
                }
            ],
            "prompt_template": {
                "messages": [
                    {
                        "role": "system",
                        "content": "Rate the response quality: poor, good, or excellent.\nRespond with JSON: {\"quality\": \"<label>\"}"
                    },
                    {
                        "role": "user",
                        "content": "Question: {{input}}\n\nResponse: {{output}}"
                    }
                ]
            }
        },
        "dataset": {
            "files_url": "hf://datasets/<workspace>/<dataset-name>"
        },
        "params": {
            "parallelism": 16,
            "limit_samples": 100  # Optional: limit for testing
        }
    }
)

print(f"Job created: {job.name} ({job.id})")

Reference a previously created metric by its unique workspace and metric name:

# First, create and store the metric
client.evaluation.metrics.create(
    name="my-quality-judge",
    type="llm-judge",
    model={
        "url": "<judge-nim-url>/v1",
        "name": "meta/llama-3.1-70b-instruct",
        "format": "nim"
    },
    scores=[
        {
            "name": "quality",
            "minimum": 1,
            "maximum": 5,
            "parser": {"type": "json", "json_path": "quality"}
        }
    ],
    prompt_template={
        "messages": [
            {"role": "system", "content": "Rate quality 1-5. Respond: {\"quality\": <1-5>}"},
            {"role": "user", "content": "{{input}}\n{{output}}"}
        ]
    }
)

# Then use it in a job by metric reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "default/my-quality-judge",
        "dataset": {"files_url": "hf://datasets/<workspace>/<dataset-name>"},
        "params": {"parallelism": 16}
    }
)

Use inline rows for quick testing before running on full datasets:

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": {
            "type": "llm-judge",
            "model": {
                "url": "<judge-nim-url>/v1",
                "name": "meta/llama-3.1-70b-instruct",
                "format": "nim"
            },
            "scores": [
                {
                    "name": "informativeness",
                    "rubric": [
                        {"label": "uninformative", "value": 0},
                        {"label": "informative", "value": 1}
                    ],
                    "parser": {"type": "json", "json_path": "informativeness"}
                }
            ],
            "prompt_template": {
                "messages": [
                    {"role": "system", "content": "Is this response informative? Reply: {\"informativeness\": \"uninformative\" or \"informative\"}"},
                    {"role": "user", "content": "{{output}}"}
                ]
            }
        },
        "dataset": {
            "rows": [
                {"output": "Paris is the capital of France."},
                {"output": "I don't know."}
            ]
        }
    }
)

Monitor Job Progress#

import time

while True:
    job_status = client.evaluation.metric_jobs.get_status(job.name)
    print(f"Status: {job_status.status}")

    if job_status.status in ["completed", "error", "cancelled"]:
        break

    time.sleep(5)

Retrieve Results#

# List available results for the job
results_list = client.evaluation.metric_jobs.results.list(job.name)
print(f"Available results: {[r.name for r in results_list.data]}")

# Get a specific result
result = client.evaluation.metric_jobs.results.retrieve(
    name="evaluation_results",
    job=job.name
)
print(result.model_dump_json(indent=2, exclude_none=True))
Example Job Results
{
    "aggregate_scores": [
        {
            "name": "quality",
            "count": 100,
            "mean": 1.2,
            "min": 0,
            "max": 2
        }
    ]
}

Score Configuration#

LLM-as-a-Judge supports two types of scores: range scores (numerical ratings) and rubric scores (categorical classifications).

Choosing Between Range and Rubric Scores#

We recommend using rubric scores over range scores for most evaluation tasks. Classification-based rubrics (for example, pass/fail, safe/unsafe, poor/good/excellent) typically outperform numerical scoring (1-10) because:

  • Reduces ambiguity: Categorical labels with explicit descriptions are easier for judge models to apply consistently than numerical scales

  • Aligns with human reasoning: People naturally think in categories rather than precise numerical gradations

  • Avoids calibration issues: Numerical scores suffer from inconsistent calibration—one judge’s “7” may be another’s “5”

  • Provides actionable insights: Clear categories (for example, “needs_improvement”, “acceptable”, “excellent”) are more actionable than abstract numbers

  • More reliable metrics: Classification tasks produce more consistent and reproducible results across different judge models

Use range scores when:

  • You need fine-grained distinctions that do not map well to categories

  • You are measuring continuous quantities (for example, latency, word count)

  • Downstream analysis requires numerical operations on scores

Use rubric scores when:

  • You are evaluating quality dimensions (helpfulness, accuracy, safety)

  • Clear decision boundaries exist (pass/fail, compliant/non-compliant)

  • Results will guide human decisions or workflows

Range Scores#

Use range scores for numerical ratings on a continuous scale:

{
    "name": "relevance",
    "description": "How relevant is the response (1=irrelevant, 5=highly relevant)",
    "minimum": 1,
    "maximum": 5,
    "parser": {"type": "json", "json_path": "relevance"}
}

Rubric Scores#

Use rubric scores for categorical evaluations with explicit criteria:

{
    "name": "sentiment",
    "description": "Sentiment of the response",
    "rubric": [
        {"label": "negative", "value": -1, "description": "Response has negative tone"},
        {"label": "neutral", "value": 0, "description": "Response is neutral"},
        {"label": "positive", "value": 1, "description": "Response has positive tone"}
    ],
    "parser": {"type": "json", "json_path": "sentiment"}
}

Tip

Rubric scores use structured outputs by default, which constrains the judge model to output valid JSON. This significantly reduces parsing errors.

Score Parsers#

Configure how scores are extracted from judge responses:

Parser Type

Use Case

Example Pattern

json

Judge outputs JSON (default)

{"type": "json", "json_path": "score_name"}

regex

Extract from free-form text

{"type": "regex", "pattern": "SCORE: (\\d+)"}

By default, the JSON parser is used for range and rubric scores, with the score name as the json_path to extract the value from.

# JSON parser (default)
"parser": {"type": "json", "json_path": "quality"}

# Regex parser (for models that do not support structured output)
"parser": {"type": "regex", "pattern": "QUALITY: (\\w+)"}

# Regex parser with method='search' (finds pattern anywhere in text)
"parser": {"type": "regex", "pattern": "SCORE: (\\d+)", "method": "search"}

Tip

Regex method options:

  • match (default): Matches the pattern only at the beginning of the text. Use when your prompt instructs the judge to output the score first.

  • search: Finds the pattern anywhere in the text. Uses the first match of the regex found in the judge output.

For example, with method: "search" and pattern SCORE: (\d+), the parser can extract the score from:

The response is accurate and well-written. SCORE: 5

This would fail with the default match method since “SCORE:” is not at the beginning. If multiple matches exist, search returns the first occurrence.


Custom Judge Prompts#

Customize the judge prompt to match your evaluation criteria. Use Jinja2 templating to access data fields and score definitions.

Template Variables#

Variable

Description

{{input}}

Input field from dataset row

{{output}}

Output field from dataset row

item.<field>

Any field from the dataset row

sample.output_text

Model-generated response (when evaluating a model)

scores

Dictionary of score definitions

Example: Custom Judge Template#

JUDGE_TEMPLATE = """You are an expert evaluator assessing AI assistant responses.

Evaluate the response on these criteria:
{% for score_name, score in scores.items() %}
- {{ score_name }}{% if score.description %}: {{ score.description }}{% endif %}
{% if score.rubric %}
  Options: {% for r in score.rubric %}{{ r.name }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endif %}
{% endfor %}

Respond with JSON containing your ratings.
"""

metric = {
    "type": "llm-judge",
    "model": {
        "url": "<judge-url>/v1",
        "name": "meta/llama-3.1-70b-instruct",
        "format": "nim"
    },
    "scores": [
        {
            "name": "clarity",
            "description": "How clear and understandable is the response",
            "rubric": [
                {"label": "confusing", "value": 0, "description": "Hard to understand"},
                {"label": "clear", "value": 1, "description": "Easy to understand"},
                {"label": "crystal_clear", "value": 2, "description": "Exceptionally well explained"}
            ],
            "parser": {"type": "json", "json_path": "clarity"}
        }
    ],
    "prompt_template": {
        "messages": [
            {"role": "system", "content": JUDGE_TEMPLATE},
            {"role": "user", "content": "Question: {{input}}\n\nResponse: {{output}}"}
        ]
    }
}

Managing Secrets for Authenticated Endpoints#

If your judge model endpoint requires an API key, store it as a secret. The secret is automatically resolved from the same workspace as your evaluation.

Create a Secret#

# Create a secret with your API key
client.secrets.create(
    name="judge-api-key",
    data="your-api-key-here"
)

Reference the Secret in Your Metric#

metric = {
    "type": "llm-judge",
    "model": {
        "url": "https://api.example.com/v1",
        "name": "gpt-4",
        "format": "openai",
        "api_key_secret": "judge-api-key"  # Just the secret name
    },
    # ... scores and prompt_template
}

Inference Parameters#

Control judge model behavior with inference parameters:

"prompt_template": {
    "messages": [...],
    "temperature": 0.1,      # Lower for more consistent scoring
    "max_tokens": 1024,      # Increase if judge needs more space
    "timeout": 30,           # Request timeout in seconds
    "stop": ["<|end_of_text|>"]  # Stop sequences
}

Note

The default value for max_tokens for judge models is set to 1024. It is highly recommended to set an appropriate value for your judge model based on the expected outputs (for example, structured_output is used by default to format model output, ensure your max_tokens is set to accommodate the full JSON output). Incomplete JSON outputs will cause parsing errors and result in NaN score values.

Reasoning Model Configuration#

For reasoning-enabled models (like Nemotron), configure reasoning parameters:

metric = {
    "type": "llm-judge",
    "model": {
        "url": "<nim-url>/v1",
        "name": "nvidia/llama-3.3-nemotron-super-49b-v1",
        "format": "nim"
    },
    # ... scores ...
    "system_prompt": "'detailed thinking on'",
    "reasoning": {
        "end_token": "</think>"
    },
    "prompt_template": {
        "messages": [...],
        "temperature": 0.1,
        "max_tokens": 4096
    }
}

Limitations#

  1. Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.

  2. NaN Scores: If the judge output cannot be parsed, the score is marked as NaN. Common causes:

    • Insufficient max_tokens (check for "finish_reason": "length" in results)

    • Judge model not following output format instructions

    • Use structured outputs or explicit format instructions to reduce NaN rates

  3. Structured Output Requirement: Rubric scores require the judge model to support guided decoding. If your judge does not support this, use regex parsers with explicit format instructions.

  4. Live Evaluation Limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.

See also