Agent Evaluation#

The VSS Agent includes a comprehensive evaluation with three specialized evaluators, each targeting a different aspect of agent behavior. The evaluators support customizable prompts, configurable metrics, and works with both Blueprints and Developer Profiles.

The evaluation is based on the NVIDIA NeMo Agent Toolkit (NAT), specifically the Evaluate Workflows module.

Evaluators#

The evaluation includes three customized evaluators, each targeting a specific aspect of agent behavior.

Report Evaluator: Assesses the quality of agent-generated reports with fine-grained scoring at the field, section, and overall report level.
Question-Answering (QA) Evaluator: Assesses the semantic accuracy of agent answers against ground truth, focusing on factual correctness.
Trajectory Evaluator: Assesses the agent’s execution path, including tool selection, parameter accuracy, and workflow efficiency.

Evaluator	Example Query
Report	“Generate a report for the video {video_name}”
QA	“Did a worker drop any boxes in the video {video_name}?”
Trajectory	Applicable to any query.

Report Evaluator#

The Report Evaluator extracts the report from the agent’s response and assesses its quality by comparing each field against ground truth references using a hierarchical, bottom-up evaluation approach.

Purpose#

Bottom-Up Evaluation: The evaluation follows a hierarchical structure. Fields are evaluated first and given a score, then sections, and so on until reaching the overall report score. Each field or section supports its own customizable evaluation metric. See Evaluation Method for details.
Dynamic Field Auto-Discovery: The evaluator automatically handles fields not explicitly defined in config, allowing flexible report formats and field naming while preserving semantic matching. See Dynamic Field Discovery for details.
Configuration-Driven: Schema changes require only config updates, no code changes. Each Blueprint or Developer Profile defines its own metrics YAML file and evaluation data. See Metrics Configuration for details.

Evaluation Method#

Field-Level Scoring#

For each section, the evaluator processes all fields:

Fields with explicit metrics: Uses the specified metric (exact_match, llm_judge, regex, f1, non_empty) to compare ground truth with generated value.

Supported Field-Level Metrics:

Metric	Description
`exact_match`	Exact string comparison with normalized whitespace
`f1`	Token-based F1 score between predicted and reference values
`regex`	Pattern matching against a regular expression
`non_empty`	Validates that the field contains non-empty content
`llm_judge`	LLM-based semantic similarity evaluation

Fields without explicit metrics (discovered at runtime using dynamic field discovery):
- Extracts all fields from the generated section that do not have explicit metrics defined in the config
- Makes a batch LLM call to score all discovered fields against the ground truth section
- LLM returns structured output:
```
{
  "<field-name>": {"score": <score>, "reference": "<matched-ground-truth-field>"},
  ...
}
```

Section-Level Scoring#

Each section can be scored in two ways:

As a single field: Use any field-level metric (e.g., llm_judge) to evaluate the entire section holistically by comparing the complete ground truth and generated sections
As an aggregate: Use method: average to compute the mean score from all child fields

Dynamic Field Discovery#

When allow_dynamic_field_discovery: true is set for a section, the evaluator automatically discovers and evaluates fields not explicitly defined in the configuration. This handles cases where field names may vary between generated reports and ground truth.

Example: The “Vehicles Involved” section might contain:

Ground Truth: Vehicle 1234 (White Truck), Vehicle 4321 (Dark Blue Truck)
Generated: Vehicle (1234) (Blue Truck), Vehicle (5678) (Dark Blue Car)

The field-level LLM judge semantically matches generated fields to ground truth and returns:

{
  "Vehicle (1234)": {"score": 0.5, "reference": "Vehicle 1234"},
  "Vehicle (5678)": {"score": 0.0, "reference": null}
}

Vehicle (1234): 0.5 — matched to Vehicle 1234, but wrong color
Vehicle (5678): 0.0 — no matching vehicle in ground truth

Evaluator Configuration#

evaluators:
  report_evaluator:
    _type: report_evaluator
    eval_metrics_config_path: <path-to-eval-metrics-config>
    evaluation_method_id: report
    object_store: report_object_store
    report_url_pattern: '<report-url-regex-pattern>'
    include_vlm_output: true
    vlm_related_fields:
      - "<section-name-1>"
      - "<section-name-2>"
    metric_configs:
      llm_judge:
        llm_name: eval_llm_judge
        max_retries: 2
        llm_judge_reasoning: true
        single_field_comparison_prompt: |
          # Custom prompt for single field comparison
        multi_field_discovery_prompt: |
          # Custom prompt for multi-field discovery

Configuration parameters:

eval_metrics_config_path: Path to the metrics YAML file that defines report structure and evaluation methods. See Metrics Configuration for details.
object_store: Name of the object store for retrieving generated reports.
evaluation_method_id: Identifier used to match against the evaluation_method field in the dataset. See dataset configuration for details.
report_url_pattern: Regex pattern to extract report URLs from agent responses.
include_vlm_output: When true, outputs a separate average score for VLM-related fields, used with vlm_related_fields.
vlm_related_fields: List of section names to include in the VLM accuracy score, used with include_vlm_output: true.
metric_configs: Configuration for field-level evaluation metrics.
- llm_judge: LLM judge settings:
  - llm_name: Reference to the LLM configuration (see LLM Configuration for details)
  - max_retries: Number of retry attempts on LLM failures
  - llm_judge_reasoning: Enable reasoning in the LLM judge
  - single_field_comparison_prompt: Custom prompt for the LLM judge used for single field evaluation
  - multi_field_discovery_prompt: Custom prompt for the LLM judge used for dynamic field discovery

LLM Configuration#

Configure the LLM judge in the llms section of your config file:

llms:
  eval_llm_judge:
    _type: nim
    model_name: <model-name>
    base_url: <base-url-for-the-llm>
    max_tokens: 2048
    temperature: 0.0

Note

To enable reasoning output, set llm_judge_reasoning: true in the evaluator config (not thinking: true in the LLM config) for wider reasoning model support.

Note

When llm_judge_reasoning is enabled, the evaluators currently require reasoning models that support outputting reasoning separately in their responses.

Metrics Configuration#

The metrics configuration should mirror the structure of your report template, defining sections and fields that match the expected report output. Each field or section can have its own evaluation metric defined. Each section can use method: average to aggregate child scores. Set allow_dynamic_field_discovery: true for sections with variable field names.

Overall Report:
  method: average
  fields:
    title:
      method: exact_match

    Basic Information:
      method: average
      fields:
        Report Identifier:
          method: non_empty
        Date of Incident:
          method: llm_judge

    Vehicles Involved:
      method: llm_judge
      allow_dynamic_field_discovery: true  # Handles dynamic vehicle field names

    # Additional sections...

Customized Question-Answering (QA) Evaluator#

The Customized Question-Answering (QA) Evaluator assesses the semantic accuracy of agent responses against ground truth answers, focusing on factual correctness rather than exact text matching.

Purpose#

Evaluate answer accuracy for question-answering tasks.
Support semantic equivalence over exact matching.
Handle various question types (Yes/No, counting, temporal, descriptive).

Evaluator Configuration#

evaluators:
  qa_evaluator:
    _type: customized_qa_evaluator
    llm_name: eval_llm_judge
    evaluation_method_id: qa
    llm_judge_reasoning: true
    custom_prompt_template: |
      You are an expert evaluator assessing an AI Agent's response accuracy.

      Question Asked: {question}

      Agent's Answer: {answer}

      Ground Truth Answer: {reference}

      # Add evaluation criteria and scoring guidelines...

Custom Prompt Variables#

{question}: The original user query
{answer}: The agent’s response
{reference}: Ground truth answer

Example Evaluation Criteria#

The default prompt includes the following evaluation criteria:

Factual Correctness: Does the answer convey the same factual information?
Completeness: Does the answer include all key information from the ground truth?
Semantic Equivalence: Is the answer semantically equivalent to the ground truth?

You can customize the prompt by modifying the custom_prompt_template field in the evaluator configuration.

Customized Trajectory Evaluator#

The Customized Trajectory Evaluator assesses the agent’s execution path, focusing on tool selection, parameter accuracy, and overall workflow efficiency. It extends NAT’s built-in Trajectory Evaluator with features such as customizable LLM Judge prompt and agent-selected tools filtering.

Purpose#

Evaluate tool selection accuracy and appropriateness
Verify parameter correctness against tool schemas
Assess workflow efficiency
Support agent-selected tools filtering to exclude internal tool calls

Agent-Selected Tools Filtering#

When track_agent_selected_tools_only: true, the evaluator filters the trajectory to include only tools explicitly selected by the agent, excluding internal tools and LLMs called within tools. This focuses on assessing the agent’s planning and tool-calling capabilities.

Evaluator Configuration#

evaluators:
  trajectory_evaluator:
    _type: customized_trajectory_evaluator
    llm_name: eval_llm_judge
    evaluation_method_id: trajectory
    track_agent_selected_tools_only: true
    llm_judge_reasoning: true
    custom_prompt_template: |
      You are an expert evaluator assessing an AI agent's performance on tool calling.

      Question: {question}

      Available Tools and Their Schemas:
      {tool_schemas}

      Agent's Actions and Tool Calls:
      {agent_trajectory}

      Agent's Final Answer:
      {answer}

      Reference/Expected Output:
      {reference}

      # Add evaluation criteria and scoring guidelines...

Custom Prompt Variables#

{question}: The original user query
{tool_schemas}: Available tools with their parameter schemas
{agent_trajectory}: Sequence of tool calls and observations
{answer}: The agent’s final response
{reference}: Expected output (if provided)

Example Evaluation Criteria#

The default prompt includes the following evaluation criteria:

Tool Selection: Did the agent select appropriate tools for the task?
Parameter Accuracy: Were tool parameters correct according to the tool schemas?
Data Retrieval: Did the agent successfully retrieve the necessary data?
Completeness: Did the agent gather all required information to answer the question?
Efficiency: Did the agent avoid unnecessary or redundant tool calls?

You can customize the prompt by modifying the custom_prompt_template field in the evaluator configuration.

In addition to the customized evaluators above, you can use NAT’s built-in evaluators or create your own custom evaluators. For more details, please refer to the NAT Evaluation Documentation.

Configuring Evaluation Dataset#

Dataset Configuration#

eval:
  general:
    output_dir: <evaluation-output-directory>
    dataset:
      _type: json
      file_path: <evaluation-dataset-file-path>
      structure:
        id_key: "id"
        question_key: "query"
        answer_key: "ground_truth"

Dataset File Format#

The evaluation dataset defines test cases and specifies which evaluators to use via the evaluation_method field, which matches against each evaluator’s evaluation_method_id. A single dataset can include cases for multiple evaluators. For example:

[
  {
    "id": "1",
    "query": "Generate a report for the video {video_name}",
    "ground_truth": "<path-to-ground-truth-file>",
    "evaluation_method": ["report"]
  },
  {
    "id": "2",
    "query": "Did a worker drop any boxes in the video {video_name}?",
    "ground_truth": "<expected-answer>",
    "evaluation_method": ["qa", "trajectory"]
  },
  {
    "id": "3",
    "query": "What videos are available?",
    "evaluation_method": ["trajectory"]
  }
]

Dataset Fields#

Field	Required	Description
`id`	Yes	Unique identifier for the test case
`query`	Yes	The question or command to send to the agent
`ground_truth`	Required for QA and report evaluation	Expected answer (QA evaluation) or path to ground truth JSON file (report evaluation)
`evaluation_method`	Yes	List of evaluator IDs, specifying which evaluators the data is run against: `["report"]`, `["qa"]`, `["trajectory"]`, or combinations like `["qa", "trajectory"]`

Report Ground Truth Format#

For QA evaluation, ground_truth is a string answer. For report evaluation, ground_truth should be a path to a JSON file whose structure mirrors your metrics configuration.

{
  "title": "<report-title>",
  "Basic Information": {
    "Report Identifier": "<report-id>",
    "Date of Incident": "<date>"
  },
  "Vehicles Involved": {
    "Vehicle 1": {
      "Vehicle ID": "<vehicle-id>",
      "Vehicle Type": "<vehicle-type>",
      "Vehicle Color": "<vehicle-color>"
    }
  }
}

Running Evaluation#

Evaluation Configuration#

Add the evaluation configuration to the agent’s config file. Multiple evaluators can run against the same dataset.

eval:
  general:
    output_dir: <evaluation-output-directory>
    max_concurrency: 10
    dataset:
      _type: json
      file_path: <evaluation-dataset-file-path>
      structure:
        id_key: "id"
        question_key: "query"
        answer_key: "ground_truth"

  evaluators:
    trajectory_evaluator:
      _type: customized_trajectory_evaluator
      llm_name: eval_llm_judge
      evaluation_method_id: trajectory
      track_agent_selected_tools_only: true

   qa_evaluator:
     _type: customized_qa_evaluator
     llm_name: eval_llm_judge
     evaluation_method_id: qa

   # Add other evaluators here ...

Execution#

The evaluation can be run using the nat eval command:

nat eval --config_file <agent-config-file>

Note

If the evaluation dataset includes queries referencing videos, ensure the videos are uploaded to the deployment before starting the evaluation.

Experiment Tracking (Optional)#

The evaluation supports Weights and Biases (W&B) Weave dashboard for experiment tracking and visualization. Please refer to the NAT documentation on visualizing results for more details.

To enable experiment tracking, follow these steps:

Install Weave Dependencies.
```
uv sync --group eval
```

Log in to wandb:

wandb login --host <wandb-host>
export WANDB_BASE_URL="<wandb-base-url>"

Enable in Config:

general:
  telemetry:
    tracing:
      weave:
        _type: weave
        project: "<your-weave-project>"

Understanding Evaluation Output#

Evaluation results are saved to the configured output_dir and include:

workflow_output.json: Raw agent responses for each query
report_evaluator_output.json: Report evaluation scores and details
qa_evaluator_output.json: QA evaluation scores and details
trajectory_evaluator_output.json: Trajectory evaluation scores and details

If more evaluators are included, the evaluation results for each of the evaluators will be saved to {evaluator_name}_evaluator_output.json.

Report Evaluator Output#

{
  "average_score": 0.9,
  "average_vlm_field_score": 0.9,
  "eval_output_items": [
    {
      "id": "report_001",
      "score": 0.9,
      "vlm_field_score": 0.8,
      "reasoning": {
        "sections": {
         "title": {
           "section_score": 1.0,
           "method": "exact_match",
           "actual_value": "<generated-title>",
           "reference_value": "<expected-title>",
           "error": null,
           "field_scores": {}
         },
          "Basic Information": {
           "section_score": 1.0,
           "method": "average",
           "actual_value": {
             "Report Identifier": "<generated-report-id>",
             "Date of Incident": "<generated-date>"
           },
           "reference_value": {
             "Report Identifier": "<expected-report-id>",
             "Date of Incident": "<expected-date>"
           },
           "error": null,
            "field_scores": {
             "Report Identifier": {
               "section_score": 1.0,
               "method": "non_empty",
               "actual_value": "<generated-report-id>",
               "reference_value": "<expected-report-id>",
               "error": null,
               "field_scores": {}
             },
             "Date of Incident": {
               "section_score": 1.0,
               "method": "llm_judge",
               "actual_value": "<generated-date>",
               "reference_value": "<expected-date>",
               "error": null,
               "field_scores": {}
             }
           }
         }
       },
       "metadata": {
         "reference_file": "<path-to-ground-truth-file>",
         "actual_file": "<path-to-generated-report>"
       }
     }
    }
  ]
}

Field Descriptions:

average_score: Mean score across all evaluated reports
average_vlm_field_score: Mean score for vlm_field_score across all evaluated reports
eval_output_items: Array of evaluation results for each test case
- id: Unique identifier matching the dataset entry
- score: Overall report score (0.0 to 1.0)
- vlm_field_score: Mean score across all evaluated VLM-related fields
- reasoning: Evaluation details
  - sections: Section-level evaluation results, each containing:
    - section_score: Score for the section or field (0.0 to 1.0)
    - method: Evaluation method used (e.g., exact_match, llm_judge, average, non_empty). Fields evaluated with dynamic field discovery enabled has llm_judge_with_field_discovery as the method.
    - actual_value: The value from the generated report
    - reference_value: The expected value from ground truth
    - error: Error message if evaluation failed for the section or field, otherwise null
    - field_scores: Nested field evaluations within a section (recursive structure)
  - metadata: File paths for reference and generated reports
    - reference_file: Path to the ground truth file
    - actual_file: Path to the generated report

QA Evaluator Output#

{
  "average_score": 0.8,
  "eval_output_items": [
    {
      "id": "vqa_001",
      "score": 1.0,
      "reasoning": {
        "reasoning": "The agent correctly identified that a worker dropped one box. The answer matches the ground truth semantically.",
        "question": "Did a worker drop any boxes in <video-name>?",
        "generated_answer": "<agent-answer>",
        "ground_truth": "<expected-answer>"
      }
    }
  ]
}

Field Descriptions:

average_score: Mean score across all evaluated QA pairs
eval_output_items: Array of evaluation results for each test case
- id: Unique identifier matching the dataset entry
- score: QA accuracy score (0.0 to 1.0)
- reasoning: Evaluation details
  - reasoning: LLM judge’s reasoning for the score
  - question: The original query sent to the agent
  - generated_answer: The agent’s response
  - ground_truth: The expected answer from the dataset

Trajectory Evaluator Output#

{
  "average_score": 0.89,
  "eval_output_items": [
    {
      "id": "traj_001",
      "score": 1.0,
      "reasoning": {
        "reasoning": "The agent used the vst_video_list tool, which is the correct tool for retrieving the list of videos. Tool selection is correct, parameters are accurate...",
        "trajectory": [
          [
            {
              "tool": "<llm-model-name>",
              "tool_input": "What videos are available?",
              "log": ""
            },
            "\n\nTool calls: [{'id': '...', 'function': {'name': 'vst_video_list', 'arguments': '{}'}}]"
          ],
          [
            {
              "tool": "vst_video_list",
              "tool_input": "{}",
              "log": "\n\nTool calls: [{'id': '...', 'function': {'name': 'vst_video_list', 'arguments': '{}'}}]"
            },
            {"video_list": [{"name": "<video-name>", "duration": "<video-duration>"}]}
          ]
        ],
        "track_agent_selected_tools_only": true
      }
    }
  ]
}

Field Descriptions:

average_score: Mean score across all evaluated trajectories
eval_output_items: Array of evaluation results for each test case
- id: Unique identifier matching the dataset entry
- score: Trajectory quality score (0.0 to 1.0)
- reasoning: Evaluation details
  - reasoning: LLM judge’s reasoning for the score
  - trajectory: The agent’s execution path, including inputs and outputs of the LLM calls and tool calls.
  - track_agent_selected_tools_only: Whether filtering was applied to show only agent-selected tools

Skipped Test Cases#

When a test case’s evaluation_method does not include a specific evaluator, that evaluator outputs a skipped result:

{
  "id": "test_001",
  "score": null,
  "reasoning": "Skipped: not marked for <evaluator-name> evaluation"
}