Agentic Evaluation Types#

Agentic evaluation types assess the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.

Agentic Evaluation Comparison#

Option

Use Case

Data Format

Example

Topic Adherence

Measures topic focus in multi-turn conversations

user_input, reference_topics

“Is the agent’s answer about ‘technology’?”

Tool Call Accuracy

Evaluates tool/function call correctness

user_input (with tool_calls)

“Did the agent call the restaurant booking tool with correct args?”

Agent Goal Accuracy with Reference

Assesses goal completion with reference

user_input, response, reference

“Did the agent book a table as requested?”

Agent Goal Accuracy without Reference

Assesses goal completion without reference

user_input, response

“Did the agent complete the requested task?”

Answer Accuracy

Checks factual correctness

user_input, response, reference

“Did the agent answer ‘Paris’ for ‘What is the capital of France’?”

Prerequisites#

Note

Performance Tuning: You can improve evaluation performance by setting config.params.parallelism to control the number of concurrent requests. A typical default value is 16, but you may need to adjust based on your model’s capacity and rate limits.


Options#

Topic Adherence#

Topic Adherence requires judge model.

{
  "type": "agentic",
  "name": "my-agentic-config-topic-adherence",
  "namespace": "my-organization",
  "params": {
    "parallelism": 16
  },
  "tasks": {
    "task1": {
      "type": "topic_adherence",
      "params": {
        "metric_mode": "f1",
        "judge": {
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "meta/llama-3.1-70b-instruct",
              "api_key": "<OPTIONAL_JUDGE_API_KEY>"
            },
            "prompt": {
              "inference_params": {
                "temperature": 0.1,
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10
              }
            }
          },
          "extra": {
            "judge_sanity_check": true
          }
        }
      }
    }
  }
}
{
  "type": "agentic",
  "name": "my-agentic-config-topic-adherence-reasoning",
  "namespace": "my-organization",
  "params": {
    "parallelism": 16
  },
  "tasks": {
    "task1": {
      "type": "topic_adherence",
      "params": {
        "metric_mode": "f1",
        "judge": {
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
              "api_key": "<OPTIONAL_JUDGE_API_KEY>"
            },
            "prompt": {
              "system_prompt": "'detailed thinking on'",
              "reasoning_params": {
                "end_token": "</think>"
              },
              "inference_params": {
                "temperature": 0.1,
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10
              }
            }
          }
        }
      }
    }
  }
}
{
  "user_input": [
    {"content": "how to keep healthy?", "type": "human"},
    {"content": "Sure. Eat more fruit", "type": "ai"}
  ],
  "reference_topics": ["technology"]
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "topic_adherence(mode=f1)": {
          "scores": {
            "topic_adherence(mode=f1)": {
              "value": 0.53
            }
          }
        }
      }
    }
  }
}

Agent Goal Accuracy with Reference#

Agent goal accuracy requires judge model.

{
  "type": "agentic",
  "name": "my-agentic-config-goal-accuracy",
  "namespace": "my-organization",
  "tasks": {
    "task1": {
      "type": "goal_accuracy_with_reference",
      "params": {
        "judge": {
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "meta/llama-3.3-70b-instruct",
              "api_key": "<OPTIONAL_JUDGE_API_KEY>"
            },
            "prompt": {
              "inference_params": {
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10,
                "temperature": 0.1
              }
            }
          }
        }
      }
    }
  }
}
{
  "type": "agentic",
  "name": "my-agentic-config-goal-accuracy-reasoning",
  "namespace": "my-organization",
  "tasks": {
    "task1": {
      "type": "goal_accuracy_with_reference",
      "params": {
        "judge": {
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
              "api_key": "<OPTIONAL_JUDGE_API_KEY>"
            },
            "prompt": {
              "system_prompt": "'detailed thinking on'",
              "reasoning_params": {
                "end_token": "</think>"
              },
              "inference_params": {
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10,
                "temperature": 0.1
              }
            }
          }
        }
      }
    }
  }
}
{
  "user_input": [
    { "content": "Hey, book a table at the nearest best Chinese restaurant for 8:00pm", "type": "user" },
    { "content": "Sure, let me find the best options for you.", "type": "assistant", "tool_calls": [ { "name": "restaurant_search", "args": { "cuisine": "Chinese", "time": "8:00pm" } } ] },
    { "content": "Found a few options: 1. Golden Dragon, 2. Jade Palace", "type": "tool" },
    { "content": "I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?", "type": "assistant" },
    { "content": "Let's go with Golden Dragon.", "type": "user" },
    { "content": "Great choice! I'll book a table for 8:00pm at Golden Dragon.", "type": "assistant", "tool_calls": [ { "name": "restaurant_book", "args": { "name": "Golden Dragon", "time": "8:00pm" } } ] },
    { "content": "Table booked at Golden Dragon for 8:00pm.", "type": "tool" },
    { "content": "Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!", "type": "assistant" },
    { "content": "thanks", "type": "user" }
  ],
  "reference": "Table booked at one of the chinese restaurants at 8 pm"
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "agent_goal_accuracy": {
          "scores": {
            "agent_goal_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Agent Goal Accuracy without Reference#

Agent goal accuracy without reference requires judge model.

{
  "type": "agentic",
  "name": "my-agentic-config-goal-accuracy-no-ref",
  "namespace": "my-organization",
  "tasks": {
    "task1": {
      "type": "goal_accuracy_without_reference",
      "params": {
        "judge": {
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "meta/llama-3.3-70b-instruct",
              "api_key": "<OPTIONAL_JUDGE_API_KEY>"
            },
            "prompt": {
              "inference_params": {
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10,
                "temperature": 0.1
              }
            }
          }
        }
      }
    }
  }
}
{
  "user_input": [
    { "content": "Set a reminder for my dentist appointment tomorrow at 2pm", "type": "user" },
    { "content": "I'll set that reminder for you.", "type": "assistant", "tool_calls": [ { "name": "set_reminder", "args": { "title": "Dentist appointment", "date": "tomorrow", "time": "2pm" } } ] },
    { "content": "Reminder set successfully.", "type": "tool" },
    { "content": "Your reminder for the dentist appointment tomorrow at 2pm has been set.", "type": "assistant" }
  ]
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "agent_goal_accuracy": {
          "scores": {
            "agent_goal_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Tool Call Accuracy#

{
  "type": "agentic",
  "name": "my-agentic-config-tool-call-accuracy",
  "namespace": "my-organization",
  "tasks": {
    "task1": {
      "type": "tool_call_accuracy"
    }
  }
}
{
  "user_input": [
    {"content": "What's the weather like in New York right now?", "type": "human"},
    {"content": "The current temperature in New York is 75°F and it's partly cloudy.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
    {"content": "Can you translate that to Celsius?", "type": "human"},
    {"content": "Let me convert that to Celsius for you.", "type": "ai", "tool_calls": [{"name": "temperature_conversion", "args": {"temperature_fahrenheit": 75}}]},
    {"content": "75°F is approximately 23.9°C.", "type": "tool"},
    {"content": "75°F is approximately 23.9°C.", "type": "ai"}
  ],
  "reference_tool_calls": [
    {"name": "weather_check", "args": {"location": "New York"}},
    {"name": "temperature_conversion", "args": {"temperature_fahrenheit": 75}}
  ]
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "tool_call_accuracy": {
          "scores": {
            "tool_call_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Answer Accuracy#

Answer accuracy requires judge model.

{
  "type": "agentic",
  "name": "my-agentic-config-answer-accuracy",
  "namespace": "my-organization",
  "tasks": {
    "task1": {
      "type": "answer_accuracy",
      "params": {
        "judge": {
          "extra": {
            "judge_sanity_check": false
          },
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "meta/llama-3.1-70b-instruct",
              "api_key": "<OPTIONAL_API_KEY>"
            },
            "prompt": {
              "inference_params": {
                "temperature": 1,
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10,
                "stop": ["<|end_of_text|>", "<|eot|>"]
              }
            }
          }
        }
      }
    }
  }
}
{
  "type": "agentic",
  "name": "my-agentic-config-answer-accuracy-reasoning",
  "namespace": "my-organization",
  "tasks": {
    "task1": {
      "type": "answer_accuracy",
      "params": {
        "judge": {
          "extra": {
            "judge_sanity_check": false
          },
          "model": {
            "api_endpoint": {
              "url": "<nim_url>",
              "model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
              "api_key": "<OPTIONAL_JUDGE_API_KEY>"
            },
            "prompt": {
              "system_prompt": "'detailed thinking on'",
              "reasoning_params": {
                "end_token": "</think>"
              },
              "inference_params": {
                "temperature": 0.1,
                "max_tokens": 1024,
                "max_retries": 10,
                "request_timeout": 10
              }
            }
          }
        }
      }
    }
  }
}
{
  "user_input": "What is the capital of France?",
  "response": "Paris",
  "reference": "Paris"
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "answer_accuracy": {
          "scores": {
            "answer_accuracy": {
                "value": 1.0
            }
          }
        }
      }
    }
  }
}

LLM as a Judge Schema#

Configure a judge for the task tasks.params.judge. The judge model configuration supports both standard and reasoning-enabled models.

Standard Judge Configuration#

{
  "extra": {
    "judge_sanity_check": false
  },
  "model": {
    "api_endpoint": {
      "url": "<nim_url>",
      "model_id": "meta/llama-3.1-70b-instruct",
      "api_key": "<OPTIONAL_API_KEY>"
    },
    "prompt": {
      "inference_params": {
        "temperature": 1,
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10,
        "stop": ["<|end_of_text|>", "<|eot|>"]
      }
    }
  }
}

Reasoning Judge Configuration#

For reasoning-enabled models (like Nemotron series), configure the judge with reasoning parameters:

Nemotron Reasoning Models#

{
  "extra": {
    "judge_sanity_check": false
  },
  "model": {
    "api_endpoint": {
      "url": "<nim_url>",
      "model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
      "api_key": "<OPTIONAL_API_KEY>"
    },
    "prompt": {
      "system_prompt": "'detailed thinking on'",
      "reasoning_params": {
        "end_token": "</think>"
      },
      "inference_params": {
        "temperature": 0.1,
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10
      }
    }
  }
}

OpenAI Reasoning Models#

{
  "extra": {
    "judge_sanity_check": false
  },
  "model": {
    "api_endpoint": {
      "url": "<openai_url>",
      "model_id": "o1-preview",
      "api_key": "<OPENAI_API_KEY>",
      "format": "openai"
    },
    "prompt": {
      "reasoning_params": {
        "effort": "medium"
      },
      "inference_params": {
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10
      }
    }
  }
}

Note

Reasoning Model Configuration: When using reasoning models as judge models in agentic evaluations:

  • Nemotron models: Use system_prompt: "'detailed thinking on'" and reasoning_params.end_token: "</think>" to enable reasoning and trim reasoning traces from the output.

  • OpenAI models: Use reasoning_params.effort to control reasoning depth (“low”, “medium”, or “high”).

  • The end_token parameter is supported for Nemotron reasoning models when configured correctly.

Metrics#

Agentic evaluation uses RAGAS metrics to score agent outputs. RAGAS is a library for evaluating retrieval-augmented generation and agentic workflows using standardized, research-backed metrics.

Each task contains a set of metrics relevant to the Agentic evaluation, such as topic adherence, tool call accuracy, agent goal accuracy, or answer accuracy, depending on the metric selected in the job configuration.

Agentic Evaluation Metrics#

Metric Name

Description

Value Range

Notes

topic_adherence(mode=f1)

Measures how well the agent sticks to the assigned topic (F1 mode)

0.0–1.0

Requires judge LLM

tool_call_accuracy

Accuracy of tool call predictions

0.0–1.0

agent_goal_accuracy

Accuracy in achieving the agent’s goal with reference

0.0–1.0

With Reference

agent_goal_accuracy

Accuracy in achieving the agent’s goal without reference

0.0–1.0

Without Reference

answer_accuracy

Accuracy of the agent’s answer

0.0–1.0

Limitations#

  • Agentic evaluation only works with cached_outputs targets.

  • The judge model must be at least 70B parameters (preferably >405B), otherwise metrics evaluation will fail.

  • Each metric can be computed via one job, and there can only be one task per job. Different metrics can’t be computed on the same dataset/job, as all metrics require different dataset formats.