Custom Evaluator#
Note
It is recommended that the Evaluating AgentIQ Workflows guide be read before proceeding with this detailed documentation.
AgentIQ provides a set of evaluators to run and evaluate AgentIQ workflows. In addition to the built-in evaluators, AgentIQ provides a plugin system to add custom evaluators.
Existing Evaluators#
You can view the list of existing evaluators by running the following command:
aiq info components -t evaluator
ragas
is an example of an existing evaluator. The ragas
evaluator is used to evaluate the accuracy of a workflow output.
Extending AgentIQ with Custom Evaluators#
To extend AgentIQ with custom evaluators, you need to create an evaluator function and register it with AgentIQ.
This section provides a step-by-step guide to create and register a custom evaluator with AgentIQ. A similarity evaluator is used as an example to demonstrate the process.
Evaluator Configuration#
The evaluator configuration is used to specify the evaluator name
and other evaluator-specific configuration parameters.
The evaluator function provides an asynchronous evaluation method.
The following is an example of an evaluator configuration and evaluator function. We add this code to a new evaluator_register.py
file in the simple example directory for testing purposes.
examples/simple/src/aiq_simple/evaluator_register.py
:
from pydantic import Field
from aiq.builder.builder import EvalBuilder
from aiq.builder.evaluator import EvaluatorInfo
from aiq.cli.register_workflow import register_evaluator
from aiq.data_models.evaluator import EvaluatorBaseConfig
class SimilarityEvaluatorConfig(EvaluatorBaseConfig, name="similarity"):
'''Configuration for custom similarity evaluator'''
similarity_type: str = = Field(description="Similarity type to be computed", default="cosine")
@register_evaluator(config_type=SimilarityEvaluatorConfig)
async def register_similarity_evaluator(config: SimilarityEvaluatorConfig, builder: EvalBuilder):
'''Register custom evaluator'''
from .similarity_evaluator import SimilarityEvaluator
evaluator = SimilarityEvaluator(config.similarity_type, builder.get_max_concurrency())
yield EvaluatorInfo(config=config, evaluate_fn=evaluator.evaluate, description="Simlaity Evaluator")
SimilarityEvaluatorConfig
specifies the evaluator name. The similarity_type
configuration parameter is used to specify the type of similarity to be computed.
The register_similarity_evaluator
function is used to register the evaluator with AgentIQ via the register_evaluator
decorator. This function provides an asynchronous evaluation method.
SimilarityEvaluator
class and the evaluation method, evaluator.evaluate
, are explained in the section Similarity Evaluator.
To ensure that evaluator is registered the evaluator function is imported, but not used, in the simple example’s register.py
examples/simple/src/aiq_simple/register.py
:
from .evaluator_register import register_similarity_evaluator # pylint: disable=unused-import
Understanding EvalInput
and EvalOutput
#
The asynchronous evaluate method provide by the custom evaluator takes an EvalInput
object as input and returns an EvalOutput
object as output.
EvalInput
is a list of EvalInputItem
objects. Each EvalInputItem
object contains the following fields:
id
: The unique identifier for the item. It is defined in the dataset file and can be an integer or a string.input_obj
: This is typically the question. It is derived from the dataset file and can be a string or any serializable object.expected_output_obj
: The expected answer for the question. It is derived from the dataset file and can be a string or any serializable object.output_obj
: The answer generated by the workflow for the question. This can be a string or any serializable object.trajectory
: List of intermediate steps returned by the workflow. This is a list ofIntermediateStep
objects.
EvalOutput
contains the following fields:
average_score
: The average score of all the items in the evaluation input. This is typically a floating point number between 0 and 1. But it can be any serializable object.eval_output_items
: A list ofEvalOutputItem
objects. EachEvalOutputItem
object contains the following fields:id
: The unique identifier for the input item.score
: The score for the item. This is typically a floating point number between 0 and 1. But it can be any serializable object.reasoning
: The reasoning for the score. This can be any serializable object.
The evaluate method computes the score for each item in the evaluation input and returns an EvalOutput
object.
Similarity Evaluator#
Similarity evaluator is used as an example to demonstrate the process of creating and registering a custom evaluator with AgentIQ. We add this code to a new similarity_evaluator.py
file in the simple example directory for testing purposes.
examples/simple/src/aiq_simple/similarity_evaluator.py
:
import asyncio
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.asyncio import tqdm
from aiq.eval.evaluator.evaluator_model import EvalInput
from aiq.eval.evaluator.evaluator_model import EvalOutput
from aiq.eval.evaluator.evaluator_model import EvalOutputItem
class SimilarityEvaluator:
'''Similarity evaluator class'''
def __init__(self, similarity_type: str, max_concurrency: int):
self.max_concurrency = max_concurrency
self.similarity_type = similarity_type
self.vectorizer = TfidfVectorizer()
self.semaphore = asyncio.Semaphore(self.max_concurrency)
async def evaluate(self, eval_input: EvalInput) -> EvalOutput:
'''Evaluate function'''
async def process_item(item):
"""Compute cosine similarity for an individual item"""
question = item.input_obj
answer = item.expected_output_obj
generated_answer = item.output_obj
# Compute TF-IDF vectors
tfidf_matrix = self.vectorizer.fit_transform([answer, generated_answer])
# Compute cosine similarity score
similarity_score = round(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0], 2)
# Provide reasoning for the score
reasoning = {
"question": question,
"answer": answer,
"generated_answer": generated_answer,
"similarity_type": "cosine"
}
return similarity_score, reasoning
# Process items concurrently with a limit on concurrency
results = await tqdm.gather(*[process_item(item) for item in eval_input.eval_input_items])
# Extract scores and reasonings
sample_scores, sample_reasonings = zip(*results) if results else ([], [])
# Compute average score
avg_score = round(sum(sample_scores) / len(sample_scores), 2) if sample_scores else 0.0
# Construct EvalOutputItems
eval_output_items = [
EvalOutputItem(id=item.id, score=score, reasoning=reasoning)
for item, score, reasoning in zip(eval_input.eval_input_items, sample_scores, sample_reasonings)
]
return EvalOutput(average_score=avg_score, eval_output_items=eval_output_items)
SimilarityEvaluator
class is used to compute the similarity between the expected output and the generated output. The evaluate
method computes the cosine similarity between the expected output and the generated output for each item in the evaluation input.
To handle concurrency, the process_item
method is used to compute the similarity score for an individual item. The process_item
method is executed concurrently using the tqdm.asyncio.gather
method.
This asynchronous handling of items is particularly useful when evaluating a large number of items via a service endpoint such as a judge LLM.
The evaluate
method returns an EvalOutput
object that contains the average score and the similarity score for each item in the evaluation input.
Display all evaluators#
To display all evaluators, run the following command:
aiq info components -t evaluator
This will now display the custom evaluator similarity
in the list of evaluators.
Evaluation configuration#
Add the evaluator to the workflow configuration file in the eval.evaluators
section. The following is an example of the similarity evaluator configuration:
examples/simple/configs/eval_config.yml
:
eval:
evaluators:
similarity_eval:
_type: similarity
similarity_type: cosine
The _type
field specifies the evaluator name. The keyword similarity_eval
can be set to any string. It is used as a prefix to the evaluator output file name.
Evaluating the workflow#
Run and evaluate the workflow using the following command:
aiq eval --config_file=examples/simple/configs/eval_config.yml
Evaluation results#
The evaluation results are stored in the output directory specified in the workflow configuration file.
examples/simple/configs/eval_config.yml
:
eval:
general:
output_dir: ./.tmp/aiq/examples/simple/
The results of each evaluator is stored in a separate file with name <keyword>_eval_output.json
. The following is an example of the similarity evaluator output file:
examples/simple/.tmp/aiq/examples/simple/similarity_eval_output.json
:
{
"average_score": 0.63,
"eval_output_items": [
{
"id": 1,
"score": 0.56,
"reasoning": {
"question": "What is langsmith",
"answer": "LangSmith is a platform for LLM application development, monitoring, and testing",
"generated_answer": "LangSmith is a platform for LLM application development, monitoring, and testing. It supports various workflows throughout the application development lifecycle, including automations, threads, annotating traces, adding runs to a dataset, prototyping, and debugging.",
"similarity_type": "cosine"
}
},
{
"id": 2,
"score": 0.78,
"reasoning": {
"question": "How do I proptotype with langsmith",
"answer": "To prototype with LangSmith, you can use its tracing feature to quickly understand how the model is performing and debug where it is failing. LangSmith provides clear visibility and debugging information at each step of an LLM sequence, making it easier to identify and root-cause issues.",
"generated_answer": "To prototype with LangSmith, you can use its tracing feature to quickly understand how the model is performing and debug where it is failing. LangSmith provides clear visibility and debugging information at each step of an LLM sequence, making it easier to identify and root-cause issues. Additionally, LangSmith supports automations, threads, and annotating traces, which can be helpful for processing traces at production scale, tracking the performance of multi-turn applications, and refining and improving the application's performance.",
"similarity_type": "cosine"
}
},
]
}
The contents of the file have been snipped
for brevity.
Summary#
This guide provides a step-by-step process to create and register a custom evaluator with AgentIQ. The similarity evaluator is used as an example to demonstrate the process. The evaluator configuration, evaluator function, and evaluation results are explained in detail.