LLM-as-Judge#
Use LLM-as-Judge to evaluate subjective qualities like truthfulness, safety, and response quality using a judge model.
Quick Example#
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.judge import judge_score
@benchmark(
name="qa-judge",
dataset="qa.jsonl",
prompt="Answer: {question}",
extra={
"judge": {
"url": "https://integrate.api.nvidia.com/v1",
"model_id": "meta/llama-3.1-70b-instruct",
"api_key": "NVIDIA_API_KEY",
},
},
)
@scorer
def qa_judge(sample: ScorerInput) -> dict:
return judge_score(sample, template="binary_qa", criteria="Factual accuracy")
The judge configuration is passed through the extra parameter of @benchmark and becomes available inside the scorer via sample.config.
Judge Configuration#
The extra={"judge": {...}} dict configures the judge model endpoint. Field names align with the nemo-skills extra.judge convention used across NeMo Evaluator containers.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Base URL of the judge model endpoint |
|
|
required |
Judge model identifier |
|
|
|
Environment variable name containing the API key |
|
|
|
Sampling temperature (0.0 for deterministic output) |
|
|
|
Nucleus sampling parameter |
|
|
|
Maximum tokens for the judge response |
|
|
|
Max concurrent judge requests (informational) |
|
|
|
Maximum retry attempts on transient failures (429, 5xx) |
|
|
|
Request timeout in seconds |
Note
The api_key field is the name of an environment variable, not the key itself. For example, setting "api_key": "NVIDIA_API_KEY" causes the framework to read the actual key from os.environ["NVIDIA_API_KEY"] at runtime.
Built-in Templates#
BYOB ships with four judge prompt templates. Each template instructs the judge model to provide chain-of-thought reasoning before emitting a structured grade line.
Template |
Grade Pattern |
Score Mapping |
Use Case |
|---|---|---|---|
|
|
|
Binary correct/incorrect |
|
|
|
Correct/partial/incorrect |
|
|
|
5-point quality scale |
|
|
|
Safety evaluation |
All built-in templates use the placeholders {question}, {response}, {reference}, and {criteria}. The framework fills these automatically from sample.metadata, sample.response, sample.target, and the criteria argument.
Custom Templates#
Pass a raw string as the template argument to use your own prompt. Use grade_pattern and score_mapping to tell the parser how to extract and map the grade. Any extra placeholders are filled via **template_kwargs.
CUSTOM_TEMPLATE = """\
Question: {question}
Response: {response}
Reference: {reference}
Custom Criteria: {my_criteria}
Output GRADE: PASS or GRADE: FAIL
"""
@scorer
def custom_judge(sample: ScorerInput) -> dict:
return judge_score(
sample,
template=CUSTOM_TEMPLATE,
grade_pattern=r"GRADE:\s*(PASS|FAIL)",
score_mapping={"PASS": 1.0, "FAIL": 0.0},
my_criteria="Check for factual accuracy and completeness",
)
Tip
The {question} placeholder is automatically resolved from sample.metadata["question"] (falling back to sample.metadata["prompt"]). The {response} and {reference} placeholders map to sample.response and sample.target respectively.
judge_score API#
The judge_score function is the primary entry point for judge-based scoring. Import it from nemo_evaluator.contrib.byob.judge.
def judge_score(
sample: ScorerInput,
template: str = "binary_qa",
criteria: str = "",
grade_pattern: Optional[str] = None,
score_mapping: Optional[Dict[str, float]] = None,
judge_key: str = "judge",
response_format: Optional[Dict[str, Any]] = None,
**template_kwargs: Any,
) -> dict:
Parameter |
Description |
|---|---|
|
The |
|
Built-in template name (e.g. |
|
Evaluation criteria injected into the template’s |
|
Regex with one capture group for the grade. Defaults to the built-in pattern for named templates, or |
|
Dict mapping grade strings to numeric scores. Defaults to the built-in mapping for named templates. |
|
Key in |
|
Optional dict for constrained decoding (e.g. |
|
Extra variables passed to the template. Override default variables or fill custom placeholders. |
Returns: {"judge_score": float, "judge_grade": str}
Fallback values on failure:
Failure |
|
|
|---|---|---|
HTTP or network error |
|
|
Grade not parseable from response |
|
|
Warning
If the grade string is not found in score_mapping and is not a valid number, the score defaults to 0.0. Always verify that your grade_pattern and score_mapping cover all possible judge outputs.
Multi-Judge Setup#
Use multiple judge models by assigning different keys in extra. Each key holds an independent judge configuration.
@benchmark(
name="multi-judge",
dataset="qa.jsonl",
prompt="Answer: {question}",
extra={
"judge": {"url": "http://judge1:8000/v1", "model_id": "judge-a"},
"judge_1": {"url": "http://judge2:8000/v1", "model_id": "judge-b"},
},
)
@scorer
def multi(sample: ScorerInput) -> dict:
a = judge_score(sample, template="binary_qa")
b = judge_score(sample, template="likert_5", judge_key="judge_1")
return {**a, "quality": b["judge_score"]}
The judge_key parameter in judge_score selects which configuration to use. The default key is "judge". Name additional judges "judge_1", "judge_2", and so on.
Tip
Each judge endpoint gets its own HTTP session with retry logic, so transient failures on one judge do not block the other.
See Also#
Scorers – Built-in scorers and custom scoring functions
Bring Your Own Benchmark (BYOB) – BYOB overview and quickstart