Evaluating NVIDIA AgentIQ Workflows Details#
Note
It is recommended that the Evaluating AgentIQ Workflows guide be read before proceeding with this detailed documentation.
AgentIQ provides a set of evaluators to run and evaluate the AgentIQ workflows. In addition to the built-in evaluators, AgentIQ provides a plugin system to add custom evaluators.
Example:
aiq eval --config_file=examples/simple/configs/eval_config.yml
Using Datasets#
Run and evaluate the workflow on a specified dataset. The dataset files types are json
, jsonl
, csv
, xls
, or parquet
.
Download and use datasets provided by AgentIQ examples by running the following.
git lfs fetch
git lfs pull
The dataset used for evaluation is specified in the configuration file via eval.general.dataset
. For example, to use the langsmith.json
dataset, the configuration is as follows:
eval:
general:
dataset:
_type: json
file_path: examples/simple/data/langsmith.json
Dataset Format#
Each dataset file contains a list of records. Each record is a dictionary with keys as the column names and values as the data. For example, a sample record in a json
dataset file is as follows:
{
"id": "q_1",
"question": "What is langsmith",
"answer": "LangSmith is a platform for LLM application development, monitoring, and testing"
},
A dataset entry are either structured or unstructured.
For structured entries, the default names of the columns are id
, question
, and answer
,
where the libraries know that question
is an input and answer
is the output. You can
change the column names and their configurations in the config.yml
file
with eval.general.dataset.structure
.
eval:
general:
dataset:
structure:
id: "my_id"
question: "my_question"
answer: "my_answer"
For unstructured entries, the entire dictionary is the input to the workflow but the libraries don’t know the individual columns. The input and the workflow output goes through evaluation, where evaluators, such as swe-bench evaluator, handle unstructured entries. The following is an example configuration for the swe-bench evaluator:
eval:
general:
dataset:
_type: json
file_path: examples/swe_bench/data/test_dataset_lite.json
id_key: instance_id
structure: # For swe-bench the entire row is the input
disable: true
Filtering Datasets#
While evaluating large datasets, you can filter the dataset to a
smaller subset by allowing or denying entries with the eval.general.dataset.filter
in the config.yml
file. The filter is a dictionary with keys as the column names and
values as the filter values.
The following is an example configuration, where evaluation
runs on a subset of the swe-bench-verified dataset, which has 500 entries. The configuration runs the
evaluation on two entries with instance identifications (instance_id
), sympy__sympy-20590
and sympy__sympy-21055
. The evaluation iteratively develops and debugs the workflows.
eval:
dataset:
_type: json
file_path: examples/swe_bench/data/test_dataset_verified.json
id_key: instance_id
structure:
disable: true
filter:
allowlist:
field:
instance_id:
- sympy__sympy-20590
- sympy__sympy-21055
The swe-bench verified dataset has 500 entries but above configuration runs the workflow and evaluation on only two entries with instance_id
sympy__sympy-20590
and sympy__sympy-21055
. This is useful for iterative development and troubleshooting of the workflow.
You can also skip entries from the dataset. Here is an example configuration to skip entries with instance_id
sympy__sympy-20590
and sympy__sympy-21055
:
eval:
dataset:
_type: json
file_path: examples/swe_bench/data/test_dataset_verified.json
id_key: instance_id
structure:
disable: true
filter:
denylist:
field:
instance_id:
- sympy__sympy-20590
- sympy__sympy-21055
AgentIQ Built-in Evaluators#
AgentIQ provides the following built-in evaluator:
ragas
- An evaluator to run and evaluate RAG-like workflows using the public RAGAS API.trajectory
- An evaluator to run and evaluate the LangChain agent trajectory.swe_bench
- An evaluator to run and evaluate the workflow on the SWE-Bench dataset.
RAGAS Evaluator#
RAGAS is an OSS evaluation framework that enables end-to-end evaluation of RAG workflows. AgentIQ provides an interface to RAGAS to evaluate the performance of RAG-like AgentIQ workflows.
RAGAS provides a set of evaluation metrics to configure in the config.yml
file
by adding an evaluator section with typeragas
.
Example:
eval:
evaluators:
rag_accuracy:
_type: ragas
metric: AnswerAccuracy
llm_name: nim_rag_eval_llm
rag_groundedness:
_type: ragas
metric: ResponseGroundedness
llm_name: nim_rag_eval_llm
rag_relevance:
_type: ragas
metric: ContextRelevance
llm_name: nim_rag_eval_llm
rag_factual_correctness:
_type: ragas
metric:
FactualCorrectness:
kwargs:
mode: precision
llm_name: nim_rag_eval_large_llm # requires more tokens
In the example four ragas
evaluators are configured to evaluate various ragas
metrics. The metric can be a string or a dictionary. If the metric is a dictionary, the kwargs
provided are passed to the metric function.
The following ragas
metrics are recommended for RAG like workflows -
AnswerAccuracy
: Evaluates the accuracy of the answer generated by the workflow against the expected answer or ground truth.
ContextRelevance
: Evaluates the relevance of the context retrieved by the workflow against the question.
ResponseGroundedness
: Evaluates the groundedness
of the response generated by the workflow based on the context retrieved by the workflow.
Agent Trajectory Evaluator#
The trajectory
evaluator uses LangChain agent trajectory evaluation to evaluate the workflow. To use the trajectory
evaluator, add the following configuration to the config.yml
file.
eval:
evaluators:
trajectory:
_type: trajectory
llm_name: nim_trajectory_eval_llm
Swe-benchmark Evaluator#
Workflows can use the swe-bench evaluator to solve swe-bench problems. To evaluate the patch, generated by the workflow, install the repository and run the PASS_TO_PASS
and FAIL_TO_PASS
tests.
Example:
eval:
general:
dataset:
_type: json
file_path: examples/swe_bench/data/test_dataset_lite.json
id_key: instance_id
structure: # For swe-bench the entire row is the input
disable: true
evaluators:
swe_bench:
_type: swe_bench
run_id: aiq_1
The swe-bench evaluator uses unstructured dataset entries. The entire row is provided as input to the workflow.
Adding Custom Evaluators#
You can add custom evaluators to evaluate the workflow output. To add a custom evaluator, you need to implement the evaluator and register it with the AgentIQ evaluator system. See the Custom Evaluator documentation for more information.
Running multiple repetitions#
You can run multiple repetitions of the evaluation by running a command line option --reps
. For example, to run the evaluation 5 times, run the following command:
aiq eval --config_file=examples/simple/configs/eval_config.yml --reps=5
This will allow you to get an average score across multiple runs and analyze the variation in the generated outputs.
Running evaluation on large datasets#
Similar to how evaluators are run in parallel, entries in the dataset are also processed in parallel. Concurrency is configurable using the eval.general.concurrency
parameter in the config.yml
file. The default value is 8. Increase or decrease the value based on the available resources.
eval:
general:
concurrency: 4
Pickup where you left off#
When running the evaluation on a large dataset, it is recommended to resume the evaluation from where it was left off. This is particularly useful while using overloaded services that may timeout while running the workflow. When that happens a workflow interrupted warning is issued and workflow output is saved to a file.
You can then re-run evaluation on that output file along with --skip_completed_entries
options.
Pass-1:
aiq eval --config_file=examples/simple/configs/eval_config.yml
This pass results in workflow interrupted warning. You can then do another pass.
Pass-2:
aiq eval --config_file=examples/simple/configs/eval_config.yml --skip_completed_entries --dataset=.tmp/aiq/examples/simple/workflow_output.json
Running evaluation offline#
You can evaluate a dataset with previously generated answers via the --skip_workflow
option. In this case the dataset has both the expected answer
and the generated_answer
.
aiq eval --config_file=examples/simple/configs/eval_config.yml --skip_workflow --dataset=.tmp/aiq/examples/simple/workflow_output.json
This assumes that the workflow output is previously generated and stored in the .tmp/aiq/examples/simple/workflow_output.json
file.
Running the workflow over a dataset without evaluation#
You can do this by running aiq eval
with a workflow configuration file that includes an eval
section with no evaluators
.
eval:
general:
output_dir: ./.tmp/aiq/examples/simple/
dataset:
_type: json
file_path: examples/simple/data/langsmith.json
Evaluation output#
The output of the workflow is stored as workflow_output.json
in the output_dir
provided in the config.yml -
eval:
general:
output_dir: ./.tmp/aiq/examples/simple/
Here is a sample workflow output snipped generated by running evaluation on the simple
example workflow -
{
"id": "1",
"question": "What is langsmith",
"answer": "LangSmith is a platform for LLM application development, monitoring, and testing",
"generated_answer": "LangSmith is a platform for LLM (Large Language Model) application development, monitoring, and testing. It provides features such as automations, threads, annotating traces, adding runs to a dataset, prototyping, and debugging to support the development lifecycle of LLM applications.",
"intermediate_steps": [
{
>>>>>>>>>>>>>>> SNIPPED >>>>>>>>>>>>>>>>>>>>>>
}
],
"expected_intermediate_steps": []
},
The output of the evaluators are stored in distinct files in the same output_dir
as <evaluator_name>_output.json
. An evaluator typically provides an average score and a score per-entry. Here is a sample rag_accuracy
output -
{
"average_score": 0.6666666666666666,
"eval_output_items": [
{
"id": 1,
"score": 0.5,
"reasoning": {
"user_input": "What is langsmith"
}
},
{
"id": 2,
"score": 0.75,
"reasoning": {
"user_input": "How do I proptotype with langsmith"
}
},
{
"id": 3,
"score": 0.75,
"reasoning": {
"user_input": "What are langsmith automations?"
}
}
]
}
Profiling and Performance Monitoring of AgentIQ Workflows#
You can profile workflows via the AgentIQ evaluation system. For more information, see the Profiler documentation.