Template Evaluation Flow#

Template evaluation provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. With template evaluation, you can bring your own datasets, define your own prompts and templates using Jinja2, and select or implement the metrics that matter most for your use case. This approach is ideal when:

You want to evaluate on tasks, data, or formats not covered by academic benchmarks.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.
You need to create custom prompts and templates for specific use cases.

Prerequisites#

Tip

For detailed information on using Jinja2 templates in your template evaluations, including template objects, syntax, and examples, see the Templating Reference.

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Template Evaluation Types#

Template Evaluation Comparison#
Evaluation	Use Case	Metrics	Example
Chat/Completion Tasks	Flexible chat/completion evaluation with custom prompts and metrics	BLEU, string-check, custom metrics	Evaluate Q&A, summarization, or custom chat flows
Tool-Calling	Evaluate function/tool call accuracy (OpenAI-compatible)	Tool-calling accuracy	Evaluate function-calling or API tasks

Chat/Completion Tasks#

Custom chat/completion evaluation allows you to assess model performance on flexible conversational or completion-based tasks using your own prompts, templates, and metrics. This is ideal for Q&A, summarization, or any scenario where you want to evaluate how well a model generates responses to user inputs, beyond standard academic benchmarks. You can define the structure of the conversation, specify expected outputs, and use metrics like BLEU or string-check to measure quality.

Config

{
  "type": "custom",
  "params": {
    "parallelism": 8
  },
  "tasks": {
    "qa": {
      "type": "chat-completion",
      "params": {
        "template": {
          "messages": [
            {"role": "user", "content": "{{item.question}}"},
            {"role": "assistant", "content": "{{item.answer}}"}
          ],
          "max_tokens": 20,
          "temperature": 0.7,
          "top_p": 0.9
        }
      },
      "metrics": {
        "bleu": {
          "type": "bleu",
          "params": {
            "references": ["{{item.reference_answer | trim}}"]
          }
        },
        "rouge": {
          "type": "rouge",
          "params": {
            "ground_truth": "{{item.reference_answer | trim}}"
          }
        },
        "string-check": {
          "type": "string-check",
          "params": {
            "check": [
              "{{item.reference_answer | trim}}",
              "equals",
              "{{output_text | trim}}"
            ]
          }
        },
        "f1": {
            "type": "f1",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        },
        "em": {
            "type": "em",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Data Format

"question","answer","reference_answer"
"What is the capital of France?","Paris","The answer is Paris"
"What is 2+2?","4","The answer is 4"
"Square root of 256?","16","The answer is 16"

Result

{
  "tasks": {
    "qa": {
      "metrics": {
        "bleu": {
          "scores": {
            "sentence": {
              "value": 32.3,
              "stats": {
                "count": 200,
                "sum": 6460.66,
                "mean": 32.3
              }
            },
            "corpus": {
              "value": 14.0
            }
          }
        },
        "rouge": {
          "scores": {
            "rouge_1_score": {
              "value": 0.238671638808714,
              "stats": {
                "count": 10,
                "sum": 2.38671638808714,
                "mean": 0.238671638808714
              }
            },
            "rouge_2_score": {
              "value": 0.14953146173038,
              "stats": {
                "count": 10,
                "sum": 1.4953146173038,
                "mean": 0.14953146173038
              }
            },
            "rouge_3_score": {
              "value": 0.118334587614537,
              "stats": {
                "count": 10,
                "sum": 1.18334587614537,
                "mean": 0.118334587614537
              }
            },
            "rouge_L_score": {
              "value": 0.198059156106409,
              "stats": {
                "count": 10,
                "sum": 1.98059156106409,
                "mean": 0.198059156106409
              }
            }
          }
        },
        "string-check": {
          "scores": {
            "string-check": {
              "value": 0.255,
              "stats": {
                "count": 200,
                "sum": 51.0,
                "mean": 0.255
              }
            }
          }
        },
        "f1": {
          "scores": {
            "f1_score": {
              "value": 0.226293156870275,
              "stats": {
                "count": 10,
                "sum": 2.26293156870275,
                "mean": 0.226293156870275
              }
            }
          }
        },
        "em": {
          "scores": {
            "em_score": {
              "value": 0,
              "stats": {
                "count": 10,
                "sum": 0,
                "mean": 0
              }
            }
          }
        }
      }
    }
  }
}

Tool-Calling#

Evaluate accuracy of function/tool calls. Compare against ground truth calls. Supports OpenAI-compatible function calling format.

Config

{
    "type": "custom",
    "tasks": {
        "my-tool-calling-task": {
            "type": "chat-completion",
            "params": {
                "template": {
                    "messages": [
                        {"role": "user", "content": "{{item.messages[0].content}}"}
                    ]
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {
                        "tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}

Data Format

{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
  ],
  "tool_calls": [
    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
  ]
}

Result

{
  "tasks": {
    "my-tool-calling-task": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 1.0
            },
            "function_name_and_args_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Templating for Tasks#

This section explains how to use Jinja2 templates for prompts and tasks in template evaluation jobs.

Available Template Objects#

When rendering templates, two default objects are available:

item: Represents the current item from the dataset.
sample: Contains data related to the output from the model. The sample.output_text represents the completion text for completion models and the content of the first message for chat models.

The properties on the item object are derived from the dataset’s column names (for CSVs) or keys (for JSONs):

All non-alphanumeric characters are replaced with underscores.
Column names are converted to lowercase.
In case of conflicts, suffixes (_1, _2, etc.), are appended to the property names.

Templates for Chat Models#

Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, {{item.review}}.

Example Template for Chat-Completion Task#

{
    "messages": [{
        "role": "system",
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, {
        "role": "user",
        "content": "Determine if the following review is positive or negative: {{item.review}}"
    }]
}

Simple Chat Templating#

If your custom data is structured as prompt and ideal_response, you can structure this as a single-turn chat.

{ 
    "messages": [{
        "role": "system", 
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, { 
        "role": "user", 
        "content": "Determine if the following review is positive or negative: {{item.prompt}}"
    }] 
} 

You can include this in a call to a /chat/completion endpoint.

Job v2

{
  "spec": {
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "messages": [{
                "role": "system",
                "content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
              }, { 
                "role": "user",
                "content": "Context: {{item.prompt}}"
              }] 
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.ideal_response}}"
                ]
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
          }
        }
      }
    },
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "<my-nim-url>/v1/chat/completions",
          "model_id": "<my-model-id>"
        }
      }
    }
  }
}

Job v1

{
  "config": {
    "type": "custom",
    "tasks": {
      "qa": {
        "type": "completion",
        "params": {
          "template": {
            "messages": [{
              "role": "system",
              "content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
            }, { 
              "role": "user",
              "content": "Context: {{item.prompt}}"
            }] 
          }
        },
        "metrics": {
          "accuracy": {
            "type": "string-check",
            "params": {
              "check": [
                "{{sample.output_text}}",
                "contains",
                "{{item.ideal_response}}"
              ]
            }
          }
        },
        "dataset": {
          "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
        }
      }
    }
  },
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "<my-nim-url>/v1/chat/completions",
        "model_id": "<my-model-id>"
      }
    }
  }
}

Messages Data Template#

If your custom data is already formatted as JSON, you can configure your template similar to the following:

{
    "messages": "{{ item.messages | tojson }}"
}

Metrics#

Template evaluation supports a wide range of metrics for different evaluation scenarios:

Template Evaluation Metrics#
Metric	Description	Range	Use Case	Key Parameters
`bleu`	Computes BLEU; 100 represents a perfect match; higher is better.	0.0–100.0	Translation, summarization	`references` (templated list); optional `candidate` (else uses output)
`rouge`	Computes ROUGE scores; higher is better.	0.0–1.0	Summarization, text generation	`ground_truth`; optional `prediction` (else uses output)
`string-check`	Compares generated text to a reference and returns 0 or 1.	0.0–1.0	Q&A, classification	`check` = [left, op, right]; ops: `equals`, `!=`, `contains`, `startswith`, `endswith`. Example: `["{{item.reference_answer \| trim}}", "equals", "{{output_text \| trim}}"]`
`f1`	Computes F1 score per item and corpus; higher indicates greater similarity.	0.0–1.0	Classification, Q&A	`ground_truth`; optional `prediction` (else uses output)
`em`	Exact Match after normalization (case-insensitive, punctuation/articles removed, whitespace normalized).	0.0–1.0	Q&A, classification	`ground_truth`; optional `prediction` (else uses output)
`number-check`	Parses the last number and compares to a reference using numeric ops or tolerance.	0.0–1.0	Extraction, math, structured outputs	`check` = [left, op, right]; ops: `==`, `!=`, `>=`, `>`, `<=`, `<`. Tolerance form: `["absolute difference", left, right, "epsilon", <number>]`
`tool-calling-accuracy`	Evaluates correctness of function/tool calls (names and arguments).	0.0–1.0	Function calling evaluation	`tool_calls_ground_truth` (templated), for example: `"{{ item.tool_calls \| tojson }}"`

Custom Dataset Format#

Template evaluation supports custom datasets in various formats:

CSV files: Simple tabular data with headers
JSON files: Structured data with nested objects
JSONL files: Line-delimited JSON objects

The dataset format depends on your specific use case and the template structure you’re using. For detailed examples, see the configuration examples above.