Metrics for Custom Evaluation#

This section describes the available metrics for custom evaluation jobs, including configuration examples and explanations.

Metrics Matrix#

Compare metric options at-a-glance using the following table.

Comparison of Available Metrics#
Metric	Description	Use Cases	Score
String-Check	Compares strings using basic operations (equals, contains, etc.)	Exact match validation Pattern matching Simple validation tasks	`0` or `1`
BLEU	Compares n-gram precision between machine and reference text	Machine translation Text generation with specific expected output Paraphrase evaluation	`0`-`100`
LLM-Judge	Uses an LLM to evaluate outputs based on custom criteria	Creative text generation Complex reasoning tasks Semantic evaluation	Configurable (for example, `0`-`10`)
Tool-Calling	Validates tool/function calls against ground truth	Function calling tasks API interaction evaluation Tool use assessment	Function name: `0` or `1` Args match: `0` or `1`

Similarity Metrics#

Custom evaluation with BLEU and string-check metrics is ideal for cases where the LLM generations are not expected to be highly creative.

String-Check Metric#

The string-check metric allows for simple string comparisons between a reference text and the output text. Various string operations such as ‘equals’, ‘not equals’, ‘contains’, ‘not contains’, ‘startswith’, ‘endswith’ can be performed. The operation is carried out between two templates that can use the same variables as the task templates.

A single score is provided as the output by the string-check metric:

The score is 1 if the check is successful.
The score is 0 otherwise.

BLEU Metric#

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human. For more information, refer to Wikipedia and IBM Documentation.

The bleu metric for custom evaluation requires references params to be set, pointing to where the ground truths in the input dataset are located.

LLM-as-a-Judge#

With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. LLM-as-a-Judge can be used for any use case, even highly generative ones, but the choice of judge is crucial in getting reliable evaluations. For more information, refer to LLM-as-a-Judge.

The llm-judge metric is complex and involves prompting the LLM with a specific task, then parsing the output to extract a score. This metric allows you to specify the model to use, the template for the task, and the parser for the output.

Use the parser field to specify the type of parser to use. Currently, only regex is supported. The parser extracts the score from the output text based on a specified pattern: the first group in the pattern is used as the value for the score.

If you don’t specify a parser, the output text is used as the score, and is converted to int or float.

The following example launches an evaluation job that performs scoring using string check, BLEU and LLM-Judge metrics.

curl

curl -X "POST" "${EVALUATOR_BASE_URL}/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
{
    "namespace": "my-organization",
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "<my-nim-url>/completions",
          "model_id": "<my-model-id>"
        }
      }
    },
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "prompt": "Answer very briefly (no explanation) this question: {{question}}.\nAnswer: ",
              "max_tokens": 30
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.answer}}"
                ]
              }
            },
            "bleu": {
              "type": "bleu",
              "params": {
                "references": [
                  "{{reference_answer}}"
                ]
              }
            },
            "accuracy-2": {
              "type": "llm-judge",
              "params": {
                "model": {
                  "api_endpoint": {
                    "url": "https://<my-judge-nim-url>",
                    "model_id": "<my-judge-model-id>"
                  }
                },
                "template": {
                  "messages": [
                    {
                      "role": "system",
                      "content": "Your task is to evaluate the semantic similarity between two responses."
                    },
                    {
                      "role": "user",
                      "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
                    }
                  ]
                },
                "scores": {
                  "similarity": {
                    "type": "int",
                    "parser": {
                      "type": "regex",
                      "pattern": "SIMILARITY: (\\d)"
                    }
                  }
                }
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/default/<my-math-dataset>"
          }
        }
      }
    }
}'

Python

data = {
    "namespace": "my-organization",
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "<my-nim-url>/completions",
          "model_id": "<my-model-id>"
        }
      }
    },
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "prompt": "Answer very briefly (no explanation) this question: {{question}}.\nAnswer: ",
              "max_tokens": 30
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.answer}}"
                ]
              }
            },
            "bleu": {
              "type": "bleu",
              "params": {
                "references": [
                  "{{reference_answer}}"
                ]
              }
            },
            "accuracy-2": {
              "type": "llm-judge",
              "params": {
                "model": {
                  "api_endpoint": {
                    "url": "<my-judge-nim-url>",
                    "model_id": "<my-judge-model-id>"
                  }
                },
                "template": {
                  "messages": [
                    {
                      "role": "system",
                      "content": "Your task is to evaluate the semantic similarity between two responses."
                    },
                    {
                      "role": "user",
                      "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{sample.output_text}}.\n\n"
                    }
                  ]
                },
                "scores": {
                  "similarity": {
                    "type": "int",
                    "parser": {
                      "type": "regex",
                      "pattern": "SIMILARITY: (\\d)"
                    }
                  }
                }
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/default/<my-math-dataset>"
          }
        }
      }
    }
}

endpoint = f"{os.environ['EVALUATOR_BASE_URL']}/evaluation/jobs"

# Make the API call
response = requests.post(endpoint, json=data).json()

# Get the job_id so we can refer to it later
job_id = response['id']
print(f"Job ID: {job_id}")

Tool-Calling#

The tool-calling metric evaluates the accuracy of tool/function calls generated by a model against ground truth tool calls. This metric is particularly useful for evaluating models that generate structured tool/function calls in OpenAI-compatible format. For more information, refer to Function calling.

Metric Scores#

The metric provides two scores:

function_name_accuracy: Score of 1.0 if all function names match exactly (order insensitive); 0.0 otherwise
function_name_and_args_accuracy: Score of 1.0 if all function names match exactly (order insensitive) AND their arguments match exactly; 0.0 otherwise

Parameters#

For tool calling evaluation, you must provide:

messages: the chat messages to pass to the model
tools: the list of tools to provide to the model (must be in OpenAI-compatible tools format)
tool_choice: (optional) setting that forces the model to determine when and how many tools to use. For more information, refer to Tool choice

Format Examples#

Consider the following when working with the tool calling metric:

Case sensitive for all comparisons
Order insensitive (supports parallel multiple tool calls being done by model out of order)
Function names must use underscores (_) instead of dots (.) to comply with OpenAI format
The custom tool calling evaluation only supports chat completions API. This means the task type can only be chat-completion.

Config

The tool calling metric requires a pointer to where the ground truths for tool_calls are located in your dataset. This is set using the tool_calls_ground_truth metric parameter.

{
    "metrics": {
        "my-tool-calling-accuracy": {
            "type": "tool-calling",
            "params": {
                "tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
            }
        }
    }
}

The tool_calls | tojson template syntax ensures proper JSON serialization of the tool calls data from your dataset.

Ground Truth

The ground truth must be formatted as an array of objects with a function property containing name and arguments. For example:

[
    {
        "function": {
            "name": "calculate_triangle_area",
            "arguments": {
                "base": 10,
                "height": 5,
                "unit": "units"
            }
        }
    }
]

Dataset

Here’s an example of how to structure your dataset for tool calling evaluation:

[
    {
        "messages": [
            {
                "role": "user",
                "content": "Find the area of a triangle with a base of 10 units and height of 5 units."
            }
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "calculate_triangle_area",
                    "description": "Calculate the area of a triangle given its base and height.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "base": {
                                "type": "integer",
                                "description": "The base of the triangle."
                            },
                            "height": {
                                "type": "integer",
                                "description": "The height of the triangle."
                            },
                            "unit": {
                                "type": "string",
                                "description": "The unit of measure (defaults to 'units' if not specified)"
                            }
                        },
                        "required": [
                            "base",
                            "height"
                        ]
                    }
                }
            }
        ],
        "tool_calls": [
            {
                "function": {
                    "name": "calculate_triangle_area",
                    "arguments": {
                        "base": 10,
                        "height": 5,
                        "unit": "units"
                    }
                }
            }
        ]
    }
]

Results

After evaluation, the results will look like:

{
  "tasks": {
    "custom-tool-calling": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 0.3475,
              "stats": {
                "count": 400,
                "sum": 139,
                "mean": 0.3475
              }
            },
            "function_name_and_args_accuracy": {
              "value": 0.1625,
              "stats": {
                "count": 400,
                "sum": 65,
                "mean": 0.1625
              }
            }
          }
        }
      }
    }
  }
}