Agent Evaluation#
The VSS Agent includes a comprehensive evaluation with three specialized evaluators, each targeting a different aspect of agent behavior. The evaluators support customizable prompts, configurable metrics, and works with both Blueprints and Developer Profiles.
The evaluation is based on the NVIDIA NeMo Agent Toolkit (NAT), specifically the Evaluate Workflows module.
Evaluators#
The evaluation includes three customized evaluators, each targeting a specific aspect of agent behavior.
Report Evaluator: Assesses the quality of agent-generated reports with fine-grained scoring at the field, section, and overall report level.
Question-Answering (QA) Evaluator: Assesses the semantic accuracy of agent answers against ground truth, focusing on factual correctness.
Trajectory Evaluator: Assesses the agent’s execution path, including tool selection, parameter accuracy, and workflow efficiency.
Evaluator |
Example Query |
|---|---|
Report |
“Generate a report for the video {video_name}” |
QA |
“Did a worker drop any boxes in the video {video_name}?” |
Trajectory |
Applicable to any query. |
Report Evaluator#
The Report Evaluator extracts the report from the agent’s response and assesses its quality by comparing each field against ground truth references using a hierarchical, bottom-up evaluation approach.
Purpose#
Bottom-Up Evaluation: The evaluation follows a hierarchical structure. Fields are evaluated first and given a score, then sections, and so on until reaching the overall report score. Each field or section supports its own customizable evaluation metric. See Evaluation Method for details.
Dynamic Field Auto-Discovery: The evaluator automatically handles fields not explicitly defined in config, allowing flexible report formats and field naming while preserving semantic matching. See Dynamic Field Discovery for details.
Configuration-Driven: Schema changes require only config updates, no code changes. Each Blueprint or Developer Profile defines its own metrics YAML file and evaluation data. See Metrics Configuration for details.
Evaluation Method#
Field-Level Scoring#
For each section, the evaluator processes all fields:
Fields with explicit metrics: Uses the specified metric (
exact_match,llm_judge,regex,f1,non_empty) to compare ground truth with generated value.
Supported Field-Level Metrics:
Metric |
Description |
|---|---|
|
Exact string comparison with normalized whitespace |
|
Token-based F1 score between predicted and reference values |
|
Pattern matching against a regular expression |
|
Validates that the field contains non-empty content |
|
LLM-based semantic similarity evaluation |
Fields without explicit metrics (discovered at runtime using dynamic field discovery):
Extracts all fields from the generated section that do not have explicit metrics defined in the config
Makes a batch LLM call to score all discovered fields against the ground truth section
LLM returns structured output:
{ "<field-name>": {"score": <score>, "reference": "<matched-ground-truth-field>"}, ... }
Section-Level Scoring#
Each section can be scored in two ways:
As a single field: Use any field-level metric (e.g.,
llm_judge) to evaluate the entire section holistically by comparing the complete ground truth and generated sectionsAs an aggregate: Use
method: averageto compute the mean score from all child fields
Dynamic Field Discovery#
When allow_dynamic_field_discovery: true is set for a section, the evaluator automatically
discovers and evaluates fields not explicitly defined in the configuration. This handles cases
where field names may vary between generated reports and ground truth.
Example: The “Vehicles Involved” section might contain:
Ground Truth:
Vehicle 1234(White Truck),Vehicle 4321(Dark Blue Truck)Generated:
Vehicle (1234)(Blue Truck),Vehicle (5678)(Dark Blue Car)
The field-level LLM judge semantically matches generated fields to ground truth and returns:
{
"Vehicle (1234)": {"score": 0.5, "reference": "Vehicle 1234"},
"Vehicle (5678)": {"score": 0.0, "reference": null}
}
Vehicle (1234): 0.5 — matched toVehicle 1234, but wrong colorVehicle (5678): 0.0 — no matching vehicle in ground truth
Evaluator Configuration#
evaluators:
report_evaluator:
_type: report_evaluator
eval_metrics_config_path: <path-to-eval-metrics-config>
evaluation_method_id: report
object_store: report_object_store
report_url_pattern: '<report-url-regex-pattern>'
include_vlm_output: true
vlm_related_fields:
- "<section-name-1>"
- "<section-name-2>"
metric_configs:
llm_judge:
llm_name: eval_llm_judge
max_retries: 2
llm_judge_reasoning: true
single_field_comparison_prompt: |
# Custom prompt for single field comparison
multi_field_discovery_prompt: |
# Custom prompt for multi-field discovery
Configuration parameters:
eval_metrics_config_path: Path to the metrics YAML file that defines report structure and evaluation methods. See Metrics Configuration for details.object_store: Name of the object store for retrieving generated reports.evaluation_method_id: Identifier used to match against theevaluation_methodfield in the dataset. See dataset configuration for details.report_url_pattern: Regex pattern to extract report URLs from agent responses.include_vlm_output: Whentrue, outputs a separate average score for VLM-related fields, used withvlm_related_fields.vlm_related_fields: List of section names to include in the VLM accuracy score, used withinclude_vlm_output: true.metric_configs: Configuration for field-level evaluation metrics.llm_judge: LLM judge settings:llm_name: Reference to the LLM configuration (see LLM Configuration for details)max_retries: Number of retry attempts on LLM failuresllm_judge_reasoning: Enable reasoning in the LLM judgesingle_field_comparison_prompt: Custom prompt for the LLM judge used for single field evaluationmulti_field_discovery_prompt: Custom prompt for the LLM judge used for dynamic field discovery
LLM Configuration#
Configure the LLM judge in the llms section of your config file:
llms:
eval_llm_judge:
_type: nim
model_name: <model-name>
base_url: <base-url-for-the-llm>
max_tokens: 2048
temperature: 0.0
Note
To enable reasoning output, set llm_judge_reasoning: true in the evaluator config (not thinking: true in the LLM config)
for wider reasoning model support.
Note
When llm_judge_reasoning is enabled, the evaluators currently require reasoning models that support
outputting reasoning separately in their responses.
Metrics Configuration#
The metrics configuration should mirror the structure of your report template, defining sections
and fields that match the expected report output. Each field or section can have its own evaluation metric defined. Each section can use method: average to aggregate
child scores. Set allow_dynamic_field_discovery: true for sections with variable field names.
Overall Report:
method: average
fields:
title:
method: exact_match
Basic Information:
method: average
fields:
Report Identifier:
method: non_empty
Date of Incident:
method: llm_judge
Vehicles Involved:
method: llm_judge
allow_dynamic_field_discovery: true # Handles dynamic vehicle field names
# Additional sections...
Customized Question-Answering (QA) Evaluator#
The Customized Question-Answering (QA) Evaluator assesses the semantic accuracy of agent responses against ground truth answers, focusing on factual correctness rather than exact text matching.
Purpose#
Evaluate answer accuracy for question-answering tasks.
Support semantic equivalence over exact matching.
Handle various question types (Yes/No, counting, temporal, descriptive).
Evaluator Configuration#
evaluators:
qa_evaluator:
_type: customized_qa_evaluator
llm_name: eval_llm_judge
evaluation_method_id: qa
llm_judge_reasoning: true
custom_prompt_template: |
You are an expert evaluator assessing an AI Agent's response accuracy.
Question Asked: {question}
Agent's Answer: {answer}
Ground Truth Answer: {reference}
# Add evaluation criteria and scoring guidelines...
Custom Prompt Variables#
{question}: The original user query{answer}: The agent’s response{reference}: Ground truth answer
Example Evaluation Criteria#
The default prompt includes the following evaluation criteria:
Factual Correctness: Does the answer convey the same factual information?
Completeness: Does the answer include all key information from the ground truth?
Semantic Equivalence: Is the answer semantically equivalent to the ground truth?
You can customize the prompt by modifying the custom_prompt_template field in the evaluator configuration.
Customized Trajectory Evaluator#
The Customized Trajectory Evaluator assesses the agent’s execution path, focusing on tool selection, parameter accuracy, and overall workflow efficiency. It extends NAT’s built-in Trajectory Evaluator with features such as customizable LLM Judge prompt and agent-selected tools filtering.
Purpose#
Evaluate tool selection accuracy and appropriateness
Verify parameter correctness against tool schemas
Assess workflow efficiency
Support agent-selected tools filtering to exclude internal tool calls
Agent-Selected Tools Filtering#
When track_agent_selected_tools_only: true, the evaluator filters the trajectory to include
only tools explicitly selected by the agent, excluding internal tools and LLMs called within tools. This focuses on assessing the agent’s planning and tool-calling capabilities.
Evaluator Configuration#
evaluators:
trajectory_evaluator:
_type: customized_trajectory_evaluator
llm_name: eval_llm_judge
evaluation_method_id: trajectory
track_agent_selected_tools_only: true
llm_judge_reasoning: true
custom_prompt_template: |
You are an expert evaluator assessing an AI agent's performance on tool calling.
Question: {question}
Available Tools and Their Schemas:
{tool_schemas}
Agent's Actions and Tool Calls:
{agent_trajectory}
Agent's Final Answer:
{answer}
Reference/Expected Output:
{reference}
# Add evaluation criteria and scoring guidelines...
Custom Prompt Variables#
{question}: The original user query{tool_schemas}: Available tools with their parameter schemas{agent_trajectory}: Sequence of tool calls and observations{answer}: The agent’s final response{reference}: Expected output (if provided)
Example Evaluation Criteria#
The default prompt includes the following evaluation criteria:
Tool Selection: Did the agent select appropriate tools for the task?
Parameter Accuracy: Were tool parameters correct according to the tool schemas?
Data Retrieval: Did the agent successfully retrieve the necessary data?
Completeness: Did the agent gather all required information to answer the question?
Efficiency: Did the agent avoid unnecessary or redundant tool calls?
You can customize the prompt by modifying the custom_prompt_template field in the evaluator configuration.
In addition to the customized evaluators above, you can use NAT’s built-in evaluators or create your own custom evaluators. For more details, please refer to the NAT Evaluation Documentation.
Configuring Evaluation Dataset#
Dataset Configuration#
eval:
general:
output_dir: <evaluation-output-directory>
dataset:
_type: json
file_path: <evaluation-dataset-file-path>
structure:
id_key: "id"
question_key: "query"
answer_key: "ground_truth"
Dataset File Format#
The evaluation dataset defines test cases and specifies which evaluators to use via the
evaluation_method field, which matches against each evaluator’s evaluation_method_id. A single dataset can include cases for multiple evaluators. For example:
[
{
"id": "1",
"query": "Generate a report for the video {video_name}",
"ground_truth": "<path-to-ground-truth-file>",
"evaluation_method": ["report"]
},
{
"id": "2",
"query": "Did a worker drop any boxes in the video {video_name}?",
"ground_truth": "<expected-answer>",
"evaluation_method": ["qa", "trajectory"]
},
{
"id": "3",
"query": "What videos are available?",
"evaluation_method": ["trajectory"]
}
]
Dataset Fields#
Field |
Required |
Description |
|---|---|---|
|
Yes |
Unique identifier for the test case |
|
Yes |
The question or command to send to the agent |
|
Required for QA and report evaluation |
Expected answer (QA evaluation) or path to ground truth JSON file (report evaluation) |
|
Yes |
List of evaluator IDs, specifying which evaluators the data is run against: |
Report Ground Truth Format#
For QA evaluation, ground_truth is a string answer. For report evaluation, ground_truth should be a path to a JSON file whose structure mirrors your metrics configuration.
{
"title": "<report-title>",
"Basic Information": {
"Report Identifier": "<report-id>",
"Date of Incident": "<date>"
},
"Vehicles Involved": {
"Vehicle 1": {
"Vehicle ID": "<vehicle-id>",
"Vehicle Type": "<vehicle-type>",
"Vehicle Color": "<vehicle-color>"
}
}
}
Running Evaluation#
Evaluation Configuration#
Add the evaluation configuration to the agent’s config file. Multiple evaluators can run against the same dataset.
eval:
general:
output_dir: <evaluation-output-directory>
max_concurrency: 10
dataset:
_type: json
file_path: <evaluation-dataset-file-path>
structure:
id_key: "id"
question_key: "query"
answer_key: "ground_truth"
evaluators:
trajectory_evaluator:
_type: customized_trajectory_evaluator
llm_name: eval_llm_judge
evaluation_method_id: trajectory
track_agent_selected_tools_only: true
qa_evaluator:
_type: customized_qa_evaluator
llm_name: eval_llm_judge
evaluation_method_id: qa
# Add other evaluators here ...
Execution#
The evaluation can be run using the nat eval command:
nat eval --config_file <agent-config-file>
Note
If the evaluation dataset includes queries referencing videos, ensure the videos are uploaded to the deployment before starting the evaluation.
Experiment Tracking (Optional)#
The evaluation supports Weights and Biases (W&B) Weave dashboard for experiment tracking and visualization. Please refer to the NAT documentation on visualizing results for more details.
To enable experiment tracking, follow these steps:
Install Weave Dependencies.
uv sync --group eval
Log in to wandb:
wandb login --host <wandb-host> export WANDB_BASE_URL="<wandb-base-url>"
Enable in Config:
general: telemetry: tracing: weave: _type: weave project: "<your-weave-project>"
Understanding Evaluation Output#
Evaluation results are saved to the configured output_dir and include:
workflow_output.json: Raw agent responses for each queryreport_evaluator_output.json: Report evaluation scores and detailsqa_evaluator_output.json: QA evaluation scores and detailstrajectory_evaluator_output.json: Trajectory evaluation scores and details
If more evaluators are included, the evaluation results for each of the evaluators will be saved to {evaluator_name}_evaluator_output.json.
Report Evaluator Output#
{
"average_score": 0.9,
"average_vlm_field_score": 0.9,
"eval_output_items": [
{
"id": "report_001",
"score": 0.9,
"vlm_field_score": 0.8,
"reasoning": {
"sections": {
"title": {
"section_score": 1.0,
"method": "exact_match",
"actual_value": "<generated-title>",
"reference_value": "<expected-title>",
"error": null,
"field_scores": {}
},
"Basic Information": {
"section_score": 1.0,
"method": "average",
"actual_value": {
"Report Identifier": "<generated-report-id>",
"Date of Incident": "<generated-date>"
},
"reference_value": {
"Report Identifier": "<expected-report-id>",
"Date of Incident": "<expected-date>"
},
"error": null,
"field_scores": {
"Report Identifier": {
"section_score": 1.0,
"method": "non_empty",
"actual_value": "<generated-report-id>",
"reference_value": "<expected-report-id>",
"error": null,
"field_scores": {}
},
"Date of Incident": {
"section_score": 1.0,
"method": "llm_judge",
"actual_value": "<generated-date>",
"reference_value": "<expected-date>",
"error": null,
"field_scores": {}
}
}
}
},
"metadata": {
"reference_file": "<path-to-ground-truth-file>",
"actual_file": "<path-to-generated-report>"
}
}
}
]
}
Field Descriptions:
average_score: Mean score across all evaluated reportsaverage_vlm_field_score: Mean score forvlm_field_scoreacross all evaluated reportseval_output_items: Array of evaluation results for each test caseid: Unique identifier matching the dataset entryscore: Overall report score (0.0 to 1.0)vlm_field_score: Mean score across all evaluated VLM-related fieldsreasoning: Evaluation detailssections: Section-level evaluation results, each containing:section_score: Score for the section or field (0.0 to 1.0)method: Evaluation method used (e.g.,exact_match,llm_judge,average,non_empty). Fields evaluated with dynamic field discovery enabled hasllm_judge_with_field_discoveryas the method.actual_value: The value from the generated reportreference_value: The expected value from ground trutherror: Error message if evaluation failed for the section or field, otherwisenullfield_scores: Nested field evaluations within a section (recursive structure)
metadata: File paths for reference and generated reportsreference_file: Path to the ground truth fileactual_file: Path to the generated report
QA Evaluator Output#
{
"average_score": 0.8,
"eval_output_items": [
{
"id": "vqa_001",
"score": 1.0,
"reasoning": {
"reasoning": "The agent correctly identified that a worker dropped one box. The answer matches the ground truth semantically.",
"question": "Did a worker drop any boxes in <video-name>?",
"generated_answer": "<agent-answer>",
"ground_truth": "<expected-answer>"
}
}
]
}
Field Descriptions:
average_score: Mean score across all evaluated QA pairseval_output_items: Array of evaluation results for each test caseid: Unique identifier matching the dataset entryscore: QA accuracy score (0.0 to 1.0)reasoning: Evaluation detailsreasoning: LLM judge’s reasoning for the scorequestion: The original query sent to the agentgenerated_answer: The agent’s responseground_truth: The expected answer from the dataset
Trajectory Evaluator Output#
{
"average_score": 0.89,
"eval_output_items": [
{
"id": "traj_001",
"score": 1.0,
"reasoning": {
"reasoning": "The agent used the vst_video_list tool, which is the correct tool for retrieving the list of videos. Tool selection is correct, parameters are accurate...",
"trajectory": [
[
{
"tool": "<llm-model-name>",
"tool_input": "What videos are available?",
"log": ""
},
"\n\nTool calls: [{'id': '...', 'function': {'name': 'vst_video_list', 'arguments': '{}'}}]"
],
[
{
"tool": "vst_video_list",
"tool_input": "{}",
"log": "\n\nTool calls: [{'id': '...', 'function': {'name': 'vst_video_list', 'arguments': '{}'}}]"
},
{"video_list": [{"name": "<video-name>", "duration": "<video-duration>"}]}
]
],
"track_agent_selected_tools_only": true
}
}
]
}
Field Descriptions:
average_score: Mean score across all evaluated trajectorieseval_output_items: Array of evaluation results for each test caseid: Unique identifier matching the dataset entryscore: Trajectory quality score (0.0 to 1.0)reasoning: Evaluation detailsreasoning: LLM judge’s reasoning for the scoretrajectory: The agent’s execution path, including inputs and outputs of the LLM calls and tool calls.track_agent_selected_tools_only: Whether filtering was applied to show only agent-selected tools
Skipped Test Cases#
When a test case’s evaluation_method does not include a specific evaluator, that evaluator outputs a skipped result:
{
"id": "test_001",
"score": null,
"reasoning": "Skipped: not marked for <evaluator-name> evaluation"
}