Agentic Evaluation Types#

Agentic evaluation types assess the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.

Prerequisites#

Before running Agentic evaluations, ensure you have:

Dataset Requirements:

Uploaded your dataset to NeMo Data Store using Hugging Face CLI or SDK (for custom datasets)
Registered your dataset in NeMo Entity Store using the Dataset APIs (for custom datasets)
Set up or selected an existing cached_outputs evaluation target
Formatted your data with cached outputs (pre-generated model responses)

Model Configuration:

Judge LLM configured for evaluation metrics (required for most agentic tasks - Tool Call Accuracy is the exception)
Proper task-specific data fields (varies by agentic task type)

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.

Note

Performance Tuning: You can improve evaluation performance by setting config.params.parallelism to control the number of concurrent requests. A typical default value is 16, but you may need to adjust based on your model’s capacity and rate limits.

Agentic Evaluation Comparison#
Option	Use Case	Data Format	Example
Topic Adherence	Measures topic focus in multi-turn conversations	user_input, reference_topics	“Is the agent’s answer about ‘technology’?”
Tool Call Accuracy	Evaluates tool/function call correctness	user_input (with tool_calls)	“Did the agent call the restaurant booking tool with correct args?”
Agent Goal Accuracy with Reference	Assesses goal completion with reference	user_input, response, reference	“Did the agent book a table as requested?”
Agent Goal Accuracy without Reference	Assesses goal completion without reference	user_input, response	“Did the agent complete the requested task?”
Answer Accuracy	Checks factual correctness	user_input, response, reference	“Did the agent answer ‘Paris’ for ‘What is the capital of France’?”

LLM as a Judge Schema#

Configure a judge for the task tasks.params.judge. The judge model configuration supports both standard and reasoning-enabled models.

Standard Judge Configuration#

{
  "extra": {
    "judge_sanity_check": false
  },
  "model": {
    "api_endpoint": {
      "url": "<nim_url>",
      "model_id": "meta/llama-3.1-70b-instruct",
      "api_key": "<OPTIONAL_API_KEY>"
    },
    "prompt": {
      "inference_params": {
        "temperature": 1,
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10,
        "stop": ["<|end_of_text|>", "<|eot|>"]
      }
    }
  }
}

Reasoning Judge Configuration#

For reasoning-enabled models (like Nemotron series), configure the judge with reasoning parameters:

Nemotron Reasoning Models#

{
  "extra": {
    "judge_sanity_check": false
  },
  "model": {
    "api_endpoint": {
      "url": "<nim_url>",
      "model_id": "nvidia/llama-3.3-nemotron-super-49b-v1",
      "api_key": "<OPTIONAL_API_KEY>"
    },
    "prompt": {
      "system_prompt": "'detailed thinking on'",
      "reasoning_params": {
        "end_token": "</think>"
      },
      "inference_params": {
        "temperature": 0.1,
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10
      }
    }
  }
}

OpenAI Reasoning Models#

{
  "extra": {
    "judge_sanity_check": false
  },
  "model": {
    "api_endpoint": {
      "url": "<openai_url>",
      "model_id": "o1-preview",
      "api_key": "<OPENAI_API_KEY>",
      "format": "openai"
    },
    "prompt": {
      "reasoning_params": {
        "effort": "medium"
      },
      "inference_params": {
        "max_tokens": 1024,
        "max_retries": 10,
        "request_timeout": 10
      }
    }
  }
}

Note

Reasoning Model Configuration: When using reasoning models as judge models in agentic evaluations:

Nemotron models: Use system_prompt: "'detailed thinking on'" and reasoning_params.end_token: "</think>" to enable reasoning and trim reasoning traces from the output.
OpenAI models: Use reasoning_params.effort to control reasoning depth (“low”, “medium”, or “high”).
The end_token parameter is supported for Nemotron reasoning models when configured correctly.

Metrics#

Agentic evaluation uses RAGAS metrics to score agent outputs. RAGAS is a library for evaluating retrieval-augmented generation and agentic workflows using standardized, research-backed metrics.

Each task contains a set of metrics relevant to the Agentic evaluation, such as topic adherence, tool call accuracy, agent goal accuracy, or answer accuracy, depending on the metric selected in the job configuration.

Agentic Evaluation Metrics#
Metric Name	Description	Value Range	Notes
`topic_adherence(mode=f1)`	Measures how well the agent sticks to the assigned topic (F1 mode)	0.0–1.0	Requires judge LLM
`tool_call_accuracy`	Accuracy of tool call predictions	0.0–1.0
`agent_goal_accuracy`	Accuracy in achieving the agent’s goal with reference	0.0–1.0	With Reference
`agent_goal_accuracy`	Accuracy in achieving the agent’s goal without reference	0.0–1.0	Without Reference
`answer_accuracy`	Accuracy of the agent’s answer	0.0–1.0

Limitations#

Agentic evaluation only works with cached_outputs targets.
The judge model must be at least 70B parameters (preferably >405B), otherwise metrics evaluation will fail. Visit Troubleshooting Unsupported Judge Model for more details.
Each metric can be computed via one job, and there can only be one task per job. Different metrics can’t be computed on the same dataset/job, as all metrics require different dataset formats.

Agentic Evaluation Types#

Prerequisites#

Options#

Topic Adherence#

Agent Goal Accuracy with Reference#

Agent Goal Accuracy without Reference#

Tool Call Accuracy#

Answer Accuracy#