Agentic Evaluation Types#
Agentic evaluation types assess the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.
Option |
Use Case |
Data Format |
Example |
---|---|---|---|
Topic Adherence |
Measures topic focus in multi-turn conversations |
user_input, reference_topics |
“Is the agent’s answer about ‘technology’?” |
Tool Call Accuracy |
Evaluates tool/function call correctness |
user_input (with tool_calls) |
“Did the agent call the restaurant booking tool with correct args?” |
Agent Goal Accuracy with Reference |
Assesses goal completion with reference |
user_input, response, reference |
“Did the agent book a table as requested?” |
Answer Accuracy |
Checks factual correctness |
user_input, response, reference |
“Did the agent answer ‘Paris’ for ‘What is the capital of France’?” |
Prerequisites#
Set up or select an existing
cached_outputs
evaluation target.
Options#
Topic Adherence#
{
"type": "agentic",
"name": "my-agentic-config-topic-adherence",
"namespace": "my-organization",
"tasks": {
"task1": {
"type": "topic_adherence",
"params": {
"metric_mode": "f1",
"judge": {
"model": {
"format": "nvidia-nim",
"url": "<nim_url>",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
}
}
}
}
{
"user_input": [
{"content": "how to keep healthy?", "type": "human"},
{"content": "Sure. Eat more fruit", "type": "ai"}
],
"reference_topics": ["technology"]
}
{
"tasks": {
"task1": {
"metrics": {
"topic_adherence(mode=f1)": {
"scores": {
"topic_adherence(mode=f1)": {
"value": 0.53
}
}
}
}
}
}
}
Agent Goal Accuracy with Reference#
{
"type": "agentic",
"name": "my-agentic-config-goal-accuracy",
"namespace": "my-organization",
"tasks": {
"task1": {
"type": "agent_goal_accuracy_with_reference",
"params": {
"judge": {
"model": {
"format": "nvidia-nim",
"url": "<nim_url>",
"model_id": "meta/llama-3.3-70b-instruct"
},
"inference_params": {
"max_new_tokens": 1024,
"max_retries": 10,
"request_timeout": 10,
"temperature": 0.1
}
}
}
}
}
}
{
"user_input": [
{ "content": "Hey, book a table at the nearest best Chinese restaurant for 8:00pm", "role": "user" },
{ "content": "Sure, let me find the best options for you.", "role": "assistant", "tool_calls": [ { "name": "restaurant_search", "args": { "cuisine": "Chinese", "time": "8:00pm" } } ] },
{ "content": "Found a few options: 1. Golden Dragon, 2. Jade Palace", "role": "tool" },
{ "content": "I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?", "role": "assistant" },
{ "content": "Let's go with Golden Dragon.", "role": "user" },
{ "content": "Great choice! I'll book a table for 8:00pm at Golden Dragon.", "role": "assistant", "tool_calls": [ { "name": "restaurant_book", "args": { "name": "Golden Dragon", "time": "8:00pm" } } ] },
{ "content": "Table booked at Golden Dragon for 8:00pm.", "role": "tool" },
{ "content": "Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!", "role": "assistant" },
{ "content": "thanks", "role": "user" }
],
"reference": "Table booked at one of the chinese restaurants at 8 pm"
}
{
"tasks": {
"task1": {
"metrics": {
"agent_goal_accuracy": {
"scores": {
"agent_goal_accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Tool Call Accuracy#
{
"type": "agentic",
"name": "my-agentic-config-tool-call-accuracy",
"namespace": "my-organization",
"tasks": {
"task1": {
"type": "tool_call_accuracy"
}
}
}
{
"user_input": [
{"content": "What's the weather like in New York right now?", "type": "human"},
{"content": "The current temperature in New York is 75°F and it's partly cloudy.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
{"content": "Can you translate that to Celsius?", "type": "human"},
{"content": "Let me convert that to Celsius for you.", "type": "ai", "tool_calls": [{"name": "temperature_conversion", "args": {"temperature_fahrenheit": 75}}]},
{"content": "75°F is approximately 23.9°C.", "type": "tool"},
{"content": "75°F is approximately 23.9°C.", "type": "ai"}
],
"reference_tool_calls": [
{"name": "weather_check", "args": {"location": "New York"}},
{"name": "temperature_conversion", "args": {"temperature_fahrenheit": 75}}
]
}
{
"tasks": {
"task1": {
"metrics": {
"tool_call_accuracy": {
"scores": {
"tool_call_accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Answer Accuracy#
{
"type": "agentic",
"name": "my-agentic-config-answer-accuracy",
"namespace": "my-organization",
"tasks": {
"task1": {
"type": "answer_accuracy",
"params": {
"judge": {
"judge_sanity_check": "false",
"model": {
"url": "<nim_url>",
"model_id": "meta/llama-3_1-8b-instruct"
},
"inference_params": {
"temperature": 1,
"max_new_tokens": 1024,
"max_retries": 10,
"request_timeout": 10,
"stop": ["<|end_of_text|>", "<|eot|>"]
}
},
"ragas_run_config": {
"max_workers": 10
}
}
}
}
}
{
"user_input": "What is the capital of France?",
"response": "Paris",
"reference": "Paris"
}
{
"tasks": {
"task1": {
"metrics": {
"answer_accuracy": {
"scores": {
"answer_accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Metrics#
Agentic evaluation uses RAGAS metrics to score agent outputs. RAGAS is a library for evaluating retrieval-augmented generation and agentic workflows using standardized, research-backed metrics.
Each task contains a set of metrics relevant to the Agentic evaluation, such as topic adherence, tool call accuracy, agent goal accuracy, or answer accuracy, depending on the metric selected in the job configuration.
Metric Name |
Description |
Value Range |
Notes |
---|---|---|---|
Measures how well the agent sticks to the assigned topic (F1 mode) |
0.0–1.0 |
Requires judge LLM |
|
Accuracy of tool call predictions |
0.0–1.0 |
||
Accuracy in achieving the agent’s goal |
0.0–1.0 |
||
Accuracy of the agent’s answer |
0.0–1.0 |
Limitations#
Agentic evaluation only works with
cached_outputs
targets.The judge model must be at least 70B parameters (preferably >405B), otherwise metrics evaluation will fail.
Each metric can be computed via one job, and there can only be one task per job. Different metrics can’t be computed on the same dataset/job, as all metrics require different dataset formats.