LLM-as-a-Judge Evaluation Flow#

Use another LLM to evaluate outputs with flexible scoring criteria. This approach is suitable for creative or complex tasks and can be adapted for domain-specific evaluations. LLM-as-a-Judge does not support pairwise model comparisons; only single-mode evaluation.

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Example Job Configuration#

LLM-as-a-Judge is a flexible and customizable evaluation. It offers options to specify the judge prompt where you can tailor for the format of your evaluation dataset and customize the score parsing.

{
  "type": "custom",
  "tasks": {
    "my-task": {
      "type": "chat-completion",
      "metrics": {
        "my-judge-metric": {
          "type": "llm-judge",
          "params": {
            "model": {
              // judge model configuration
              "api_endpoint": {
                "url": "<nim_url>",
                "model_id": "meta/llama-3.1-70b-instruct",
                "api_key": "<OPTIONAL_API_KEY>"
              }
            },
            "template": {
              // required
            },
            "scores": {
              // required
            }
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

LLM-as-a-Judge Core Concepts#

The judge template is comprised of the judge prompt, judge response format, and jinja template(s). Each component is specific to the metric you want to evaluate and your evaluation dataset.

Judge prompt: The judge prompt defines the metric to evaluate your target for.

An as example, the judge prompt to evaluate a metric of the similarity between two items could be:

Your task is to evaluate the semantic similarity between two responses.
Judge output format: The judge needs to be guided to output a consistent parsable format by the score parser.

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.
Jinja templating: Use jinja templating to format the judge prompt to evaluate the target sample.

Target model: The following is an example judge prompt with jinja templating to compare a column from your dataset to the output text from the target model response. For this example, the dataset contains a column named answer representing a golden example, and output_text, the special template variable for a model response.

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.\nEXPECTED: {{item.answer}}\nACTUAL: {{output_text}}

Target dataset: The following example is a judge prompt with jinja templating to compare multiple columns of your dataset. For this example, the dataset contains columns named answer representing a golden example and response which can represent an offline model response.

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.\nEXPECTED: {{item.answer}}\nACTUAL: {{item.response}}

You can use jinja template functions to modify the content.
```
{{ item.answer | trim | lower }}
```

Limitations#

Evaluation with LLM-as-a-Judge is dependent on the quality of the judge model to follow instructions for formatting a consistent and parsable output to use for metric scores.

If a score cannot be parsed from the judge output, the score for that sample is marked NaN (not a number). We recommend leveraging structured output for judge models that are smaller than 70B to improve the formatting of the response and reduce NaN results.

Task Types#

LLM-as-a-Judge is a metric that can be used with different task types:

chat-completion - Generate chat responses from a target model, then evaluate with an LLM judge. Use this for conversational tasks where you want to prompt a model in chat format and then judge the responses.
data - Evaluate existing prompt/response pairs directly (no model inference needed). Use this when you already have model outputs and want to judge them.

Choose chat-completion when you need to generate new outputs from a target model first or choose data when you already have model outputs to evaluate.

Chat-Completion Task

Use when you want to generate chat responses from a target model and then judge them.

The task params.template is the template for rendering the inference request for the target model. The template can use jinja templating to render content from the task dataset.

{
  "type": "custom",
  "tasks": {
    "my-chat-task": {
      "type": "chat-completion",
      "params": {
        "template": {
          "messages": [
            {
              "role": "system",
              "content": "You are a helpful assistant."
            },
            {
              "role": "user",
              "content": "{{item.user_message}}"
            }
          ]
        }
      },
      "metrics": {
        "helpfulness": {
          "type": "llm-judge",
          "params": {
            "model": {
              "api_endpoint": {
                "url": "<my-judge-nim-url>",
                "model_id": "<my-judge-model-id>"
              }
            },
            "template": {
              "messages": [
                {
                  "role": "system",
                  "content": "Your task is to evaluate how helpful an assistant's response is."
                },
                {
                  "role": "user",
                  "content": "Rate helpfulness from 1-5. Format: HELPFUL: X\n\nUSER: {{item.user_message}}\nASSISTANT: {{output_text}}"
                }
              ]
            },
            "scores": {
              "helpfulness": {
                "type": "int",
                "parser": {
                  "type": "regex",
                  "pattern": "HELPFUL: (\\d)"
                }
              }
            }
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Example dataset for Chat-Completion Task:

{
  "user_message": "What is the capital of France?"
}

Completion Task

To evaluate generated completions from a target model completions instead of chat conversation, use the task type completion and task params.template as the prompt string.

{
  "type": "custom",
  "tasks": {
    "my-chat-task": {
      "type": "chat-completion",
      "params": {
        "template": "Answer this question: {{item.question}}\nAnswer:"
      },
      // requires model, template, scores
    }
  }
}

Example dataset for Completion Task:

{
  "question": "What is the capital of France?",
  "expected_answer": "Paris"
}

Data Task

Use when you have existing prompt/response pairs to evaluate directly.

{
  "type": "custom",
  "tasks": {
    "my-data-task": {
      "type": "data",
      "metrics": {
        "accuracy": {
          "type": "llm-judge",
          "params": {
            "model": {
              "api_endpoint": {
                "url": "<my-judge-nim-url>",
                "model_id": "<my-judge-model-id>"
              }
            },
            "template": {
              "messages": [
                {
                  "role": "system",
                  "content": "Your task is to evaluate the semantic similarity between two responses."
                },
                {
                  "role": "user",
                  "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{item.model_output}}.\n\n"
                }
              ]
            },
            "scores": {
              "similarity": {
                "type": "int",
                "parser": {
                  "type": "regex",
                  "pattern": "SIMILARITY: (\\d+)"
                }
              }
            }
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Example dataset for Dataset Task:

{
  "reference_answer": "Paris",
  "model_output": "The capital of France is Paris"
}

Result Examples

Chat-Completion Task Result:

{
  "tasks": {
    "my-chat-task": {
      "metrics": {
        "helpfulness": {
          "scores": {
            "helpfulness": {
              "value": 4,
              "stats": {
                "count": 75,
                "mean": 4.1,
                "min": 2,
                "max": 5
              }
            }
          }
        }
      }
    }
  }
}

Data Task Result:

{
  "tasks": {
    "my-data-task": {
      "metrics": {
        "accuracy": {
          "scores": {
            "similarity": {
              "value": 8,
              "stats": {
                "count": 100,
                "mean": 7.5,
                "min": 3,
                "max": 10
              }
            }
          }
        }
      }
    }
  }
}

Score Parser#

Build a score parser that is curated for your judge model and evaluation task. A score type must be a numerical value or boolean.

Regex#

Use regular expression to parse the score from the judge model output. Build a regex that accounts for the specified judge output format for your configuration.

For example, the following pattern will match the formatted judge response SIMILARITY: 4 as outlined in the judge prompt:

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.

            "scores": {
              "similarity": {
                "type": "int",
                "parser": {
                  "type": "regex",
                  "pattern": "SIMILARITY: (\\d+)"
                }
              }
            }

Structured Output#

Some models perform better as a judge when configured with structured output.

Use JSON score parser in conjunction with structured_output to specify the format of the judge model and the JSON path to use as the score.

structured_output supports JSON schema
The score parser.json_path is required to be defined by the JSON schema.

              "structured_output": {
                "schema": {
                  "type": "object",
                  "properties": {
                    "similarity": { "type": "number" }
                   },
                   "required": ["similarity"],
                   "additionalProperties": false
                }
              },
              "scores": {
                "similarity": {
                  "type": "number",
                  "parser": {
                    "type": "json",
                    "json_path": "similarity"
                  }
                }
              }

The JSON score parser leverages NIM structured generation to format the output of the judge model.

Important

Evaluator does not support structured output with OpenAI yet.

Important

Structured output and JSON score parser may not work well with reasoning models as the judge.

Judge Model Configuration#

Judge configuration placement depends on the evaluation flow. The configuration supports both standard and reasoning-enabled models.

LLM-as-a-Judge metrics (when metric.type is llm-judge): configure the judge model under tasks.<task>.metrics.<metric>.params.model.
Agentic and some academic flows: configure the judge at task level under tasks.<task>.params.judge.

{
  "model": {
    "api_endpoint": {
      "url": "<nim_url>",
      "model_id": "meta/llama-3.1-70b-instruct",
      "api_key": "<OPTIONAL_API_KEY>"
    },
    "prompt": {
      "inference_params": {
        // optional inference parameters e.g.
        // temperature, max_tokens, max_retries, request_timeout, stop (tokens)
        "temperature": 1,
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10,
        "stop": ["<|end_of_text|>", "<|eot|>"]
      }
    }
  }
}

Reasoning Judge Configuration#

For reasoning-enabled models (such as Nemotron), configure the judge with reasoning parameters. Refer to Advanced Reasoning for more details.

Nemotron Reasoning Models

Use system_prompt: "'detailed thinking on'" and reasoning_params.end_token: "</think>" to enable reasoning and trim reasoning traces from the output.
The end_token parameter is supported for Nemotron reasoning models when configured.

{
  "model": {
    "api_endpoint": {
      "url": "<nim_url>",
      "model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
      "api_key": "<OPTIONAL_API_KEY>"
    },
    "prompt": {
      "system_prompt": "'detailed thinking on'",
      "reasoning_params": {
        "end_token": "</think>"
      }
    }
  }
}

OpenAI Reasoning Models

Use reasoning_params.effort to control reasoning depth (“low”, “medium”, or “high”).

{
  "model": {
    "api_endpoint": {
      "url": "<openai_url>",
      "model_id": "o1-preview",
      "api_key": "<OPENAI_API_KEY>",
      "format": "openai"
    },
    "prompt": {
      "reasoning_params": {
        "effort": "medium"
      }
    }
  }
}