Template Evaluation Flow#

Template evaluation provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. With template evaluation, you can bring your own datasets, define your own prompts and templates using Jinja2, and select or implement the metrics that matter most for your use case. This approach is ideal when:

  • You want to evaluate on tasks, data, or formats not covered by academic benchmarks.

  • You need to measure model performance using custom or business-specific criteria.

  • You want to experiment with new evaluation methodologies, metrics, or workflows.

  • You need to create custom prompts and templates for specific use cases.

Prerequisites#

Tip

For detailed information on using Jinja2 templates in your template evaluations, including template objects, syntax, and examples, see the Templating Reference.


Template Evaluation Types#

Template Evaluation Comparison#

Evaluation

Use Case

Metrics

Example

Chat/Completion Tasks

Flexible chat/completion evaluation with custom prompts and metrics

BLEU, string-check, custom metrics

Evaluate Q&A, summarization, or custom chat flows

Tool-Calling

Evaluate function/tool call accuracy (OpenAI-compatible)

Tool-calling accuracy

Evaluate function-calling or API tasks


Chat/Completion Tasks#

Custom chat/completion evaluation allows you to assess model performance on flexible conversational or completion-based tasks using your own prompts, templates, and metrics. This is ideal for Q&A, summarization, or any scenario where you want to evaluate how well a model generates responses to user inputs, beyond standard academic benchmarks. You can define the structure of the conversation, specify expected outputs, and use metrics like BLEU or string-check to measure quality.

{
  "type": "custom",
  "params": {
    "parallelism": 8
  },
  "tasks": {
    "qa": {
      "type": "chat-completion",
      "params": {
        "template": {
          "messages": [
            {"role": "user", "content": "{{item.question}}"},
            {"role": "assistant", "content": "{{item.answer}}"}
          ],
          "max_tokens": 20,
          "temperature": 0.7,
          "top_p": 0.9
        }
      },
      "metrics": {
        "bleu": {
          "type": "bleu",
          "params": {
            "references": ["{{item.reference_answer | trim}}"]
          }
        },
        "rouge": {
          "type": "rouge",
          "params": {
            "ground_truth": "{{item.reference_answer | trim}}"
          }
        },
        "string-check": {
          "type": "string-check",
          "params": {
            "check": [
              "{{item.reference_answer | trim}}",
              "equals",
              "{{output_text | trim}}"
            ]
          }
        },
        "f1": {
            "type": "f1",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        },
        "em": {
            "type": "em",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}
"question","answer","reference_answer"
"What is the capital of France?","Paris","The answer is Paris"
"What is 2+2?","4","The answer is 4"
"Square root of 256?","16","The answer is 16"
{
  "tasks": {
    "qa": {
      "metrics": {
        "bleu": {
          "scores": {
            "sentence": {
              "value": 32.3,
              "stats": {
                "count": 200,
                "sum": 6460.66,
                "mean": 32.3
              }
            },
            "corpus": {
              "value": 14.0
            }
          }
        },
        "rouge": {
          "scores": {
            "rouge_1_score": {
              "value": 0.238671638808714,
              "stats": {
                "count": 10,
                "sum": 2.38671638808714,
                "mean": 0.238671638808714
              }
            },
            "rouge_2_score": {
              "value": 0.14953146173038,
              "stats": {
                "count": 10,
                "sum": 1.4953146173038,
                "mean": 0.14953146173038
              }
            },
            "rouge_3_score": {
              "value": 0.118334587614537,
              "stats": {
                "count": 10,
                "sum": 1.18334587614537,
                "mean": 0.118334587614537
              }
            },
            "rouge_L_score": {
              "value": 0.198059156106409,
              "stats": {
                "count": 10,
                "sum": 1.98059156106409,
                "mean": 0.198059156106409
              }
            }
          }
        },
        "string-check": {
          "scores": {
            "string-check": {
              "value": 0.255,
              "stats": {
                "count": 200,
                "sum": 51.0,
                "mean": 0.255
              }
            }
          }
        },
        "f1": {
          "scores": {
            "f1_score": {
              "value": 0.226293156870275,
              "stats": {
                "count": 10,
                "sum": 2.26293156870275,
                "mean": 0.226293156870275
              }
            }
          }
        },
        "em": {
          "scores": {
            "em_score": {
              "value": 0,
              "stats": {
                "count": 10,
                "sum": 0,
                "mean": 0
              }
            }
          }
        }
      }
    }
  }
}

Tool-Calling#

Evaluate accuracy of function/tool calls. Compare against ground truth calls. Supports OpenAI-compatible function calling format.

{
    "type": "custom",
    "name": "my-configuration-tool-calling-1",
    "namespace": "my-organization",
    "tasks": {
        "my-tool-calling-task": {
            "type": "chat-completion",
            "params": {
                "template": {
                    "messages": [
                        {"role": "user", "content": "{{item.messages[0].content}}"}
                    ]
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {
                        "tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}
{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
  ],
  "tool_calls": [
    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
  ]
}
{
  "tasks": {
    "my-tool-calling-task": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 1.0
            },
            "function_name_and_args_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Templating for Tasks#

This section explains how to use Jinja2 templates for prompts and tasks in template evaluation jobs.

Available Template Objects#

When rendering templates, two default objects are available:

  • item: Represents the current item from the dataset.

  • sample: Contains data related to the output from the model. The sample.output_text represents the completion text for completion models and the content of the first message for chat models.

The properties on the item object are derived from the dataset’s column names (for CSVs) or keys (for JSONs):

  • All non-alphanumeric characters are replaced with underscores.

  • Column names are converted to lowercase.

  • In case of conflicts, suffixes (_1, _2, etc.), are appended to the property names.

Templates for Chat Models#

Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, {{item.review}}.

Example Template for Chat-Completion Task#

{
    "messages": [{
        "role": "system",
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, {
        "role": "user",
        "content": "Determine if the following review is positive or negative: {{item.review}}"
    }]
}

Simple Chat Templating#

If your custom data is structured as prompt and ideal_response, you can structure this as a single-turn chat.

{ 
    "messages": [{
        "role": "system", 
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, { 
        "role": "user", 
        "content": "Determine if the following review is positive or negative: {{item.prompt}}"
    }] 
} 

You can include this in a call to a /chat/completion endpoint.

   {
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "messages": [{
                "role": "system",
                "content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
              }, { 
                "role": "user",
                "content": "Context: {{item.prompt}}"
              }] 
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.ideal_response}}"
                ]
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
          }
        }
      }
    },
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "<my-nim-url>/v1/chat/completions",
          "model_id": "<my-model-id>"
        }
      }
    }
  }

Messages Data Template#

If your custom data is already formatted as JSON, you can configure your template similar to the following:

{
    "messages": "{{ item.messages | tojson }}"
}

Metrics#

Template evaluation supports a wide range of metrics for different evaluation scenarios:

Template Evaluation Metrics#

Metric

Description

Range

Use Case

Key Parameters

bleu

Computes BLEU; 100 represents a perfect match; higher is better.

0.0–100.0

Translation, summarization

references (templated list); optional candidate (else uses output)

rouge

Computes ROUGE scores; higher is better.

0.0–1.0

Summarization, text generation

ground_truth; optional prediction_ref (else uses output)

string-check

Compares generated text to a reference and returns 0 or 1.

0.0–1.0

Q&A, classification

check = [left, op, right]; ops: equals, !=, contains, startswith, endswith.
Example: ["{{item.reference_answer | trim}}", "equals", "{{output_text | trim}}"]

f1

Computes F1 score per item and corpus; higher indicates greater similarity.

0.0–1.0

Classification, Q&A

ground_truth; optional prediction_ref (else uses output)

em

Exact Match after normalization (case-insensitive, punctuation/articles removed, whitespace normalized).

0.0–1.0

Q&A, classification

ground_truth; optional prediction_ref (else uses output)

number-check

Parses the last number and compares to a reference using numeric ops or tolerance.

0.0–1.0

Extraction, math, structured outputs

check = [left, op, right]; ops: ==, !=, >=, >, <=, <.
Tolerance form: ["absolute difference", left, right, "epsilon", <number>]

tool-calling-accuracy

Evaluates correctness of function/tool calls (names and arguments).

0.0–1.0

Function calling evaluation

tool_calls_ground_truth (templated), for example: "{{ item.tool_calls | tojson }}"

Custom Dataset Format#

Template evaluation supports custom datasets in various formats:

  • CSV files: Simple tabular data with headers

  • JSON files: Structured data with nested objects

  • JSONL files: Line-delimited JSON objects

The dataset format depends on your specific use case and the template structure you’re using. For detailed examples, see the configuration examples above.