LLM-as-a-Judge Evaluation Flow#
Use another LLM to evaluate outputs with flexible scoring criteria. This approach is suitable for creative or complex tasks and can be adapted for domain-specific evaluations. LLM-as-a-Judge does not support pairwise model comparisons; only single-mode evaluation.
Prerequisites#
Set up or select an existing evaluation target.
Configure a judge LLM for evaluation metrics.
Limitations#
The judge model must be at least 70B parameters (preferably >405B), otherwise metrics evaluation will fail due to judge output not matching the specified metric template. Visit Troubleshooting Unsupported Judge Model for more details.
Judge Model Configuration#
Judge configuration placement depends on the evaluation flow. The configuration supports both standard and reasoning-enabled models.
Agentic and some academic flows: configure the judge at task level under
tasks.<task>.params.judge
.LLM-as-a-Judge metrics (when
metric.type
isllm-judge
): configure the judge model undertasks.<task>.metrics.<metric>.params.model
.
Standard Judge Configuration#
{
"extra": {
"judge_sanity_check": false
},
"model": {
"api_endpoint": {
"url": "<nim_url>",
"model_id": "meta/llama-3.1-70b-instruct",
"api_key": "<OPTIONAL_API_KEY>"
},
"prompt": {
"inference_params": {
"temperature": 1,
"max_tokens": 1024,
"max_retries": 10,
"request_timeout": 10,
"stop": ["<|end_of_text|>", "<|eot|>"]
}
}
}
}
Reasoning Judge Configuration#
For reasoning-enabled models (such as Nemotron), configure the judge with reasoning parameters. Refer to Advanced Reasoning for more details.
Use
system_prompt: "'detailed thinking on'"
andreasoning_params.end_token: "</think>"
to enable reasoning and trim reasoning traces from the output.The
end_token
parameter is supported for Nemotron reasoning models when configured.
{
"extra": {
"judge_sanity_check": false
},
"model": {
"api_endpoint": {
"url": "<nim_url>",
"model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
"api_key": "<OPTIONAL_API_KEY>"
},
"prompt": {
"system_prompt": "'detailed thinking on'",
"reasoning_params": {
"end_token": "</think>"
},
"inference_params": {
"temperature": 0.1,
"max_tokens": 1024,
"max_retries": 10,
"request_timeout": 10
}
}
}
}
Use
reasoning_params.effort
to control reasoning depth (“low”, “medium”, or “high”).
{
"extra": {
"judge_sanity_check": false
},
"model": {
"api_endpoint": {
"url": "<openai_url>",
"model_id": "o1-preview",
"api_key": "<OPENAI_API_KEY>",
"format": "openai"
},
"prompt": {
"reasoning_params": {
"effort": "medium"
},
"inference_params": {
"max_tokens": 1024,
"max_retries": 10,
"request_timeout": 10
}
}
}
}
Task Types#
LLM-as-a-Judge can be used with three different task types:
data
- Evaluate existing prompt/response pairs directly (no model inference needed). Use this when you already have model outputs and want to judge them.completion
- Generate completions from a target model, then evaluate with an LLM judge. Use this for completion-style tasks where you want to prompt a model and then judge the outputs.chat-completion
- Generate chat responses from a target model, then evaluate with an LLM judge. Use this for conversational tasks where you want to prompt a model in chat format and then judge the responses.
Choose data
when you already have model outputs to evaluate, or completion
/chat-completion
when you need to generate new outputs from a target model first.
Use when you have existing prompt/response pairs to evaluate directly.
{
"type": "custom",
"name": "my-configuration-llm-judge-data",
"namespace": "my-organization",
"tasks": {
"my-data-task": {
"type": "data",
"metrics": {
"accuracy": {
"type": "llm-judge",
"params": {
"model": {
"api_endpoint": {
"url": "<my-judge-nim-url>",
"model_id": "<my-judge-model-id>"
}
},
"template": {
"messages": [
{
"role": "system",
"content": "Your task is to evaluate the semantic similarity between two responses."
},
{
"role": "user",
"content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{item.model_output}}.\n\n"
}
]
},
"scores": {
"similarity": {
"type": "int",
"parser": {
"type": "regex",
"pattern": "SIMILARITY: (\\d+)"
}
}
}
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-dataset>"
}
}
}
}
Use when you want to generate completions from a target model and then judge them.
{
"type": "custom",
"name": "my-configuration-llm-judge-completion",
"namespace": "my-organization",
"tasks": {
"my-completion-task": {
"type": "completion",
"params": {
"template": "Answer this question: {{item.question}}\nAnswer:"
},
"metrics": {
"quality": {
"type": "llm-judge",
"params": {
"model": {
"api_endpoint": {
"url": "<my-judge-nim-url>",
"model_id": "<my-judge-model-id>"
}
},
"template": {
"messages": [
{
"role": "system",
"content": "Your task is to evaluate the quality of an answer to a question."
},
{
"role": "user",
"content": "Rate the quality from 1-10. Format: QUALITY: X\n\nQUESTION: {{item.question}}\nANSWER: {{output_text}}\nEXPECTED: {{item.expected_answer}}"
}
]
},
"scores": {
"quality": {
"type": "int",
"parser": {
"type": "regex",
"pattern": "QUALITY: (\\d+)"
}
}
}
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-dataset>"
}
}
}
}
Use when you want to generate chat responses from a target model and then judge them.
{
"type": "custom",
"name": "my-configuration-llm-judge-chat",
"namespace": "my-organization",
"tasks": {
"my-chat-task": {
"type": "chat-completion",
"params": {
"template": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "{{item.user_message}}"
}
]
}
},
"metrics": {
"helpfulness": {
"type": "llm-judge",
"params": {
"model": {
"api_endpoint": {
"url": "<my-judge-nim-url>",
"model_id": "<my-judge-model-id>"
}
},
"template": {
"messages": [
{
"role": "system",
"content": "Your task is to evaluate how helpful an assistant's response is."
},
{
"role": "user",
"content": "Rate helpfulness from 1-5. Format: HELPFUL: X\n\nUSER: {{item.user_message}}\nASSISTANT: {{output_text}}"
}
]
},
"scores": {
"helpfulness": {
"type": "int",
"parser": {
"type": "regex",
"pattern": "HELPFUL: (\\d)"
}
}
}
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-dataset>"
}
}
}
}
For Data Task Type:
{
"reference_answer": "Paris",
"model_output": "The capital of France is Paris"
}
For Completion Task Type:
{
"question": "What is the capital of France?",
"expected_answer": "Paris"
}
For Chat-Completion Task Type:
{
"user_message": "What is the capital of France?"
}
Data Task Type Result:
{
"tasks": {
"my-data-task": {
"metrics": {
"accuracy": {
"scores": {
"similarity": {
"value": 8,
"stats": {
"count": 100,
"mean": 7.5,
"min": 3,
"max": 10
}
}
}
}
}
}
}
}
Completion Task Type Result:
{
"tasks": {
"my-completion-task": {
"metrics": {
"quality": {
"scores": {
"quality": {
"value": 8,
"stats": {
"count": 50,
"mean": 7.2,
"min": 4,
"max": 10
}
}
}
}
}
}
}
}
Chat-Completion Task Type Result:
{
"tasks": {
"my-chat-task": {
"metrics": {
"helpfulness": {
"scores": {
"helpfulness": {
"value": 4,
"stats": {
"count": 75,
"mean": 4.1,
"min": 2,
"max": 5
}
}
}
}
}
}
}
}
Download Results#
The following Python script can be used to download the generated results:
import huggingface_hub as hh
import requests
url = "<NeMo Data Store URL>"
token = "mock"
repo_name = "<evaluation id>"
download_path = "<Path where results will be downloaded>"
repo_name = f'nvidia/{repo_name}'
api = hh.HfApi(endpoint=url, token=token)
repo_type = 'dataset'
api.snapshot_download(repo_id=repo_name, repo_type=repo_type, local_dir=download_path, local_dir_use_symlinks=False)
Results Directory Structure#
results/
├── results.yml # Main evaluation results file
├── requests.json # Detailed request logs for LLM-as-a-Judge calls
└── metadata.json # Job metadata and configuration
Results Files Description#
results.yml: Contains the main evaluation results in the standardized format with tasks, metrics, and scores
requests.json: Contains detailed logs of all LLM requests and responses made during the evaluation, useful for debugging and analysis
metadata.json: Contains job configuration, timestamps, and other metadata about the evaluation run
Judge LLM Output Format
The Judge LLM must provide ratings in the format [[rating]]
. If the required format is not followed, a warning will appear in the .csv
file. Adjust inference parameters or use a different Judge LLM if needed.
Custom Dataset Format#
question.jsonl#
For LLM-as-a-judge, the question.jsonl
file contains questions to be evaluated by the LLM judge. Each line in this file represents a single question with its metadata, including a unique identifier, category, and the conversation turns.
Field |
Type |
Required |
Description |
---|---|---|---|
|
integer |
Yes |
Unique identifier for the question. |
|
string |
Yes |
Category or topic of the question (e.g., ‘math’, ‘general’). |
|
list of strings |
Yes |
List of user turns (questions or conversation turns). For single-turn, use a single-element list. |
{"question_id": 1, "category": "general", "turns": ["What is the capital of France?"]}
{"question_id": 2, "category": "math", "turns": ["What is 2+2?"]}
judge_prompts.jsonl#
For LLM-as-a-judge, the judge_prompts.jsonl
file contains the prompt templates used by the LLM judge to evaluate model responses. Each line in this file represents a different prompt configuration with system instructions and templates.
Field |
Type |
Required |
Description |
---|---|---|---|
|
string |
Yes |
Name of the prompt template (e.g., ‘single-v1’). |
|
string |
Yes |
Type of prompt (e.g., ‘single’). |
|
string |
Yes |
System message for the judge LLM (instructions for judging). |
|
string |
Yes |
Template for the user prompt, with placeholders for question, answer, etc. |
|
string |
No |
Description of the prompt’s intended use. |
|
string or list of strings |
No |
Category or categories this prompt applies to. |
|
string |
Yes |
Required output format for the judge LLM (e.g., ‘[[rating]]’). |
{"name": "single-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": "general", "output_format": "[[rating]]"}
reference.jsonl#
For LLM-as-a-judge, the reference.jsonl
file contains reference answers or ground truth for questions. This file is optional but useful for evaluations where you want to compare model responses against known correct answers.
Field |
Type |
Required |
Description |
---|---|---|---|
|
integer or string |
Yes |
The question_id this reference is associated with. |
|
list of objects |
Yes |
List of reference answers or context objects. Each object typically has an |
{"question_id": 1, "choices": [{"index": 0, "turns": ["Paris"]}]}
{"question_id": 2, "choices": [{"index": 0, "turns": ["4"]}]}