Metrics for Custom Evaluation#
This section describes the available metrics for custom evaluation jobs, including configuration examples and explanations.
Metrics Matrix#
Compare metric options at-a-glance using the following table.
Metric |
Description |
Use Cases |
Score |
---|---|---|---|
String-Check |
Compares strings using basic operations (equals, contains, etc.) |
|
|
BLEU |
Compares n-gram precision between machine and reference text |
|
|
LLM-Judge |
Uses an LLM to evaluate outputs based on custom criteria |
|
Configurable (for example, |
Tool-Calling |
Validates tool/function calls against ground truth |
|
|
Similarity Metrics#
Custom evaluation with BLEU and string-check metrics is ideal for cases where the LLM generations are not expected to be highly creative. For more information, refer to Similarity Metrics Evaluation Type.
String-Check Metric#
The string-check
metric allows for simple string comparisons between a reference text and the output text. Various string operations such as ‘equals’, ‘not equals’, ‘contains’, ‘not contains’, ‘startswith’, ‘endswith’ can be performed. The operation is carried out between two templates that can use the same variables as the task templates.
A single score is provided as the output by the string-check
metric:
The score is
1
if the check is successful.The score is
0
otherwise.
Example Configuration with String Check Metric
{
"type": "string-check",
"params": {
"check": [
"{{sample.output_text}}",
"equals",
"{{item.label}}"
]
}
}
BLEU Metric#
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human. For more information, refer to Wikipedia and IBM Documentation.
The bleu
metric for custom
evaluation requires references
params to be set, pointing to where the ground truths in the input dataset are located.
Example Configuration with BLEU Metric
{
"bleu": {
"type": "bleu",
"params": {
"references": [
"{{reference_answer}}"
]
}
}
}
LLM-as-a-Judge#
With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. LLM-as-a-Judge can be used for any use case, even highly generative ones, but the choice of judge is crucial in getting reliable evaluations. For more information, refer to LLM-as-a-Judge.
The llm-judge
metric is complex and involves prompting the LLM with a specific task, then parsing the output to extract a score. This metric allows you to specify the model to use, the template for the task, and the parser for the output.
Use the parser
field to specify the type of parser to use. Currently, only regex
is supported. The parser extracts the score from the output text based on a specified pattern: the first group in the pattern is used as the value for the score.
If you don’t specify a parser, the output text is used as the score, and is converted to int
or float
.
Example Configuration for LLM-as-a-Judge
{
"type": "llm-judge",
"params": {
"model": "model-name",
"template": {
"messages": [{
"role": "system",
"content": "You are an expert in evaluating the quality of responses."
}, {
"role": "user",
"content": "Please evaluate the following response: {{sample.output_text}}"
}]
},
"parser": {
"type": "regex",
"pattern": "Rating: (\\d+)"
}
}
}
The following example launches an evaluation job that performs scoring using string check, BLEU and LLM-Judge metrics.
curl -X "POST" "${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"namespace": "my-organization",
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "<my-nim-url>/completions",
"model_id": "<my-model-id>"
}
}
},
"config": {
"type": "custom",
"tasks": {
"qa": {
"type": "completion",
"params": {
"template": {
"prompt": "Answer very briefly (no explanation) this question: {{question}}.\nAnswer: ",
"max_tokens": 30
}
},
"metrics": {
"accuracy": {
"type": "string-check",
"params": {
"check": [
"{{sample.output_text}}",
"contains",
"{{item.answer}}"
]
}
},
"bleu": {
"type": "bleu",
"params": {
"references": [
"{{reference_answer}}"
]
}
},
"accuracy-2": {
"type": "llm-judge",
"params": {
"model": {
"api_endpoint": {
"url": "https://<my-judge-nim-url>",
"model_id": "<my-judge-model-id>"
}
},
"template": {
"messages": [
{
"role": "system",
"content": "Your task is to evaluate the semantic similarity between two responses."
},
{
"role": "user",
"content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
}
]
},
"scores": {
"similarity": {
"type": "int",
"parser": {
"type": "regex",
"pattern": "SIMILARITY: (\\d)"
}
}
}
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-math-dataset>"
}
}
}
}
}'
data = {
"namespace": "my-organization",
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "<my-nim-url>/completions",
"model_id": "<my-model-id>"
}
}
},
"config": {
"type": "custom",
"tasks": {
"qa": {
"type": "completion",
"params": {
"template": {
"prompt": "Answer very briefly (no explanation) this question: {{question}}.\nAnswer: ",
"max_tokens": 30
}
},
"metrics": {
"accuracy": {
"type": "string-check",
"params": {
"check": [
"{{sample.output_text}}",
"contains",
"{{item.answer}}"
]
}
},
"bleu": {
"type": "bleu",
"params": {
"references": [
"{{reference_answer}}"
]
}
},
"accuracy-2": {
"type": "llm-judge",
"params": {
"model": {
"api_endpoint": {
"url": "<my-judge-nim-url>",
"model_id": "<my-judge-model-id>"
}
},
"template": {
"messages": [
{
"role": "system",
"content": "Your task is to evaluate the semantic similarity between two responses."
},
{
"role": "user",
"content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
}
]
},
"scores": {
"similarity": {
"type": "int",
"parser": {
"type": "regex",
"pattern": "SIMILARITY: (\\d)"
}
}
}
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-math-dataset>"
}
}
}
}
}
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs"
# Make the API call
response = requests.post(endpoint, json=data).json()
# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")
Example Response
{
"created_at": "2025-03-08T01:12:40.986445",
"updated_at": "2025-03-08T01:12:40.986445",
"id": "evaluation_result-P7SA6X9b5DP9xuYXzTMvo8",
"job": "eval-EgWCHn3bosKsr5jGiKPT9w",
"tasks": {
"qa": {
"metrics": {
"accuracy": {
"scores": {
"string-check": {
"value": 0.833333333333333,
"stats": {
"count": 6,
"sum": 5,
"mean": 0.833333333333333
}
}
}
},
"bleu": {
"scores": {
"sentence": {
"value": 1.39085669840765,
"stats": {
"count": 6,
"sum": 8.34514019044593,
"mean": 1.39085669840765
}
},
"corpus": {
"value": 0
}
}
},
"accuracy-2": {
"scores": {
"similarity": {
"value": 2.5,
"stats": {
"count": 6,
"sum": 15,
"mean": 2.5
}
}
}
}
}
}
},
"groups": {},
"namespace": "default",
"custom_fields": {}
}
Tool-Calling#
The tool-calling
metric evaluates the accuracy of tool/function calls generated by a model against ground truth tool calls. This metric is particularly useful for evaluating models that generate structured tool/function calls in OpenAI-compatible format. For more information, refer to Function calling.
Metric Scores#
The metric provides two scores:
function_name_accuracy
: Score of1.0
if all function names match exactly (order insensitive);0.0
otherwisefunction_name_and_args_accuracy
: Score of1.0
if all function names match exactly (order insensitive) AND their arguments match exactly;0.0
otherwise
Parameters#
For tool calling evaluation, you must provide:
messages
: the chat messages to pass to the modeltools
: the list of tools to provide to the model (must be in OpenAI-compatible tools format)tool_choice
: (optional) setting that forces the model to determine when and how many tools to use. For more information, refer to Tool choice
Format Examples#
Consider the following when working with the tool calling metric:
Case sensitive for all comparisons
Order insensitive (supports parallel multiple tool calls being done by model out of order)
Function names must use underscores (
_
) instead of dots (.
) to comply with OpenAI formatThe custom tool calling evaluation only supports chat completions API. This means the task type can only be
chat-completion
.
The tool calling metric requires a pointer to where the ground truths for tool_calls
are located in your dataset. This is set using the tool_calls_ground_truth
metric parameter.
{
"metrics": {
"my-tool-calling-accuracy": {
"type": "tool-calling",
"params": {
"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
}
}
}
}
The tool_calls | tojson
template syntax ensures proper JSON serialization of the tool calls data from your dataset.
The ground truth must be formatted as an array of objects with a function
property containing name
and arguments
. For example:
[
{
"function": {
"name": "calculate_triangle_area",
"arguments": {
"base": 10,
"height": 5,
"unit": "units"
}
}
}
]
Here’s an example of how to structure your dataset for tool calling evaluation:
[
{
"messages": [
{
"role": "user",
"content": "Find the area of a triangle with a base of 10 units and height of 5 units."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "calculate_triangle_area",
"description": "Calculate the area of a triangle given its base and height.",
"parameters": {
"type": "object",
"properties": {
"base": {
"type": "integer",
"description": "The base of the triangle."
},
"height": {
"type": "integer",
"description": "The height of the triangle."
},
"unit": {
"type": "string",
"description": "The unit of measure (defaults to 'units' if not specified)"
}
},
"required": [
"base",
"height"
]
}
}
}
],
"tool_calls": [
{
"function": {
"name": "calculate_triangle_area",
"arguments": {
"base": 10,
"height": 5,
"unit": "units"
}
}
}
]
}
]
After evaluation, the results will look like:
{
"tasks": {
"custom-tool-calling": {
"metrics": {
"tool-calling-accuracy": {
"scores": {
"function_name_accuracy": {
"value": 0.3475,
"stats": {
"count": 400,
"sum": 139,
"mean": 0.3475
}
},
"function_name_and_args_accuracy": {
"value": 0.1625,
"stats": {
"count": 400,
"sum": 65,
"mean": 0.1625
}
}
}
}
}
}
}
}