Template Evaluation Flow#
Template evaluation provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. With template evaluation, you can bring your own datasets, define your own prompts and templates using Jinja2, and select or implement the metrics that matter most for your use case. This approach is ideal when:
You want to evaluate on tasks, data, or formats not covered by academic benchmarks.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.
You need to create custom prompts and templates for specific use cases.
Prerequisites#
Set up or select an existing evaluation target.
Tip
For detailed information on using Jinja2 templates in your template evaluations, including template objects, syntax, and examples, see the Templating Reference.
Template Evaluation Types#
Evaluation |
Use Case |
Metrics |
Example |
---|---|---|---|
Chat/Completion Tasks |
Flexible chat/completion evaluation with custom prompts and metrics |
BLEU, string-check, custom metrics |
Evaluate Q&A, summarization, or custom chat flows |
Tool-Calling |
Evaluate function/tool call accuracy (OpenAI-compatible) |
Tool-calling accuracy |
Evaluate function-calling or API tasks |
Chat/Completion Tasks#
Custom chat/completion evaluation allows you to assess model performance on flexible conversational or completion-based tasks using your own prompts, templates, and metrics. This is ideal for Q&A, summarization, or any scenario where you want to evaluate how well a model generates responses to user inputs, beyond standard academic benchmarks. You can define the structure of the conversation, specify expected outputs, and use metrics like BLEU or string-check to measure quality.
{
"type": "custom",
"params": {
"parallelism": 8
},
"tasks": {
"qa": {
"type": "chat-completion",
"params": {
"template": {
"messages": [
{"role": "user", "content": "{{item.question}}"},
{"role": "assistant", "content": "{{item.answer}}"}
],
"max_tokens": 20,
"temperature": 0.7,
"top_p": 0.9
}
},
"metrics": {
"bleu": {
"type": "bleu",
"params": {
"references": ["{{item.reference_answer | trim}}"]
}
},
"rouge": {
"type": "rouge",
"params": {
"ground_truth": "{{item.reference_answer | trim}}"
}
},
"string-check": {
"type": "string-check",
"params": {
"check": [
"{{item.reference_answer | trim}}",
"equals",
"{{output_text | trim}}"
]
}
},
"f1": {
"type": "f1",
"params": {
"ground_truth": "{{item.reference_answer | trim}}"
}
},
"em": {
"type": "em",
"params": {
"ground_truth": "{{item.reference_answer | trim}}"
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-dataset>"
}
}
}
}
"question","answer","reference_answer"
"What is the capital of France?","Paris","The answer is Paris"
"What is 2+2?","4","The answer is 4"
"Square root of 256?","16","The answer is 16"
{
"tasks": {
"qa": {
"metrics": {
"bleu": {
"scores": {
"sentence": {
"value": 32.3,
"stats": {
"count": 200,
"sum": 6460.66,
"mean": 32.3
}
},
"corpus": {
"value": 14.0
}
}
},
"rouge": {
"scores": {
"rouge_1_score": {
"value": 0.238671638808714,
"stats": {
"count": 10,
"sum": 2.38671638808714,
"mean": 0.238671638808714
}
},
"rouge_2_score": {
"value": 0.14953146173038,
"stats": {
"count": 10,
"sum": 1.4953146173038,
"mean": 0.14953146173038
}
},
"rouge_3_score": {
"value": 0.118334587614537,
"stats": {
"count": 10,
"sum": 1.18334587614537,
"mean": 0.118334587614537
}
},
"rouge_L_score": {
"value": 0.198059156106409,
"stats": {
"count": 10,
"sum": 1.98059156106409,
"mean": 0.198059156106409
}
}
}
},
"string-check": {
"scores": {
"string-check": {
"value": 0.255,
"stats": {
"count": 200,
"sum": 51.0,
"mean": 0.255
}
}
}
},
"f1": {
"scores": {
"f1_score": {
"value": 0.226293156870275,
"stats": {
"count": 10,
"sum": 2.26293156870275,
"mean": 0.226293156870275
}
}
}
},
"em": {
"scores": {
"em_score": {
"value": 0,
"stats": {
"count": 10,
"sum": 0,
"mean": 0
}
}
}
}
}
}
}
}
Tool-Calling#
Evaluate accuracy of function/tool calls. Compare against ground truth calls. Supports OpenAI-compatible function calling format.
{
"type": "custom",
"name": "my-configuration-tool-calling-1",
"namespace": "my-organization",
"tasks": {
"my-tool-calling-task": {
"type": "chat-completion",
"params": {
"template": {
"messages": [
{"role": "user", "content": "{{item.messages[0].content}}"}
]
}
},
"metrics": {
"tool-calling-accuracy": {
"type": "tool-calling",
"params": {
"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
}
}
},
"dataset": {
"files_url": "hf://datasets/default/<my-dataset>"
}
}
}
}
{
"messages": [
{"role": "user", "content": "Book a table for 2 at 7pm."},
{"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
],
"tool_calls": [
{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
]
}
{
"tasks": {
"my-tool-calling-task": {
"metrics": {
"tool-calling-accuracy": {
"scores": {
"function_name_accuracy": {
"value": 1.0
},
"function_name_and_args_accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Templating for Tasks#
This section explains how to use Jinja2 templates for prompts and tasks in template evaluation jobs.
Available Template Objects#
When rendering templates, two default objects are available:
item: Represents the current item from the dataset.
sample: Contains data related to the output from the model. The
sample.output_text
represents the completion text for completion models and the content of the first message for chat models.
The properties on the item
object are derived from the dataset’s column names (for CSVs) or keys (for JSONs):
All non-alphanumeric characters are replaced with underscores.
Column names are converted to lowercase.
In case of conflicts, suffixes (
_1
,_2
, etc.), are appended to the property names.
Templates for Chat Models#
Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, {{item.review}}
.
Example Template for Chat-Completion Task#
{
"messages": [{
"role": "system",
"content": "You are an expert in analyzing the sentiment of movie reviews."
}, {
"role": "user",
"content": "Determine if the following review is positive or negative: {{item.review}}"
}]
}
Simple Chat Templating#
If your custom data is structured as prompt
and ideal_response
, you can structure this as a single-turn chat.
{
"messages": [{
"role": "system",
"content": "You are an expert in analyzing the sentiment of movie reviews."
}, {
"role": "user",
"content": "Determine if the following review is positive or negative: {{item.prompt}}"
}]
}
You can include this in a call to a /chat/completion
endpoint.
{
"config": {
"type": "custom",
"tasks": {
"qa": {
"type": "completion",
"params": {
"template": {
"messages": [{
"role": "system",
"content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
}, {
"role": "user",
"content": "Context: {{item.prompt}}"
}]
}
},
"metrics": {
"accuracy": {
"type": "string-check",
"params": {
"check": [
"{{sample.output_text}}",
"contains",
"{{item.ideal_response}}"
]
}
}
},
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
}
}
}
},
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "<my-nim-url>/v1/chat/completions",
"model_id": "<my-model-id>"
}
}
}
}
Messages Data Template#
If your custom data is already formatted as JSON, you can configure your template similar to the following:
{
"messages": "{{ item.messages | tojson }}"
}
Metrics#
Template evaluation supports a wide range of metrics for different evaluation scenarios:
Metric |
Description |
Range |
Use Case |
Key Parameters |
---|---|---|---|---|
|
Computes BLEU; 100 represents a perfect match; higher is better. |
0.0–100.0 |
Translation, summarization |
|
|
Computes ROUGE scores; higher is better. |
0.0–1.0 |
Summarization, text generation |
|
|
Compares generated text to a reference and returns 0 or 1. |
0.0–1.0 |
Q&A, classification |
|
|
Computes F1 score per item and corpus; higher indicates greater similarity. |
0.0–1.0 |
Classification, Q&A |
|
|
Exact Match after normalization (case-insensitive, punctuation/articles removed, whitespace normalized). |
0.0–1.0 |
Q&A, classification |
|
|
Parses the last number and compares to a reference using numeric ops or tolerance. |
0.0–1.0 |
Extraction, math, structured outputs |
|
|
Evaluates correctness of function/tool calls (names and arguments). |
0.0–1.0 |
Function calling evaluation |
|
Custom Dataset Format#
Template evaluation supports custom datasets in various formats:
CSV files: Simple tabular data with headers
JSON files: Structured data with nested objects
JSONL files: Line-delimited JSON objects
The dataset format depends on your specific use case and the template structure you’re using. For detailed examples, see the configuration examples above.