Custom Evaluator#

Note

It is recommended that the Evaluating AgentIQ Workflows guide be read before proceeding with this detailed documentation.

AgentIQ provides a set of evaluators to run and evaluate AgentIQ workflows. In addition to the built-in evaluators, AgentIQ provides a plugin system to add custom evaluators.

Existing Evaluators#

You can view the list of existing evaluators by running the following command:

aiq info components -t evaluator

ragas is an example of an existing evaluator. The ragas evaluator is used to evaluate the accuracy of a workflow output.

Extending AgentIQ with Custom Evaluators#

To extend AgentIQ with custom evaluators, you need to create an evaluator function and register it with AgentIQ.

This section provides a step-by-step guide to create and register a custom evaluator with AgentIQ. A similarity evaluator is used as an example to demonstrate the process.

Evaluator Configuration#

The evaluator configuration is used to specify the evaluator name and other evaluator-specific configuration parameters.

The evaluator function provides an asynchronous evaluation method.

The following is an example of an evaluator configuration and evaluator function. We add this code to a new evaluator_register.py file in the simple example directory for testing purposes.

examples/simple/src/aiq_simple/evaluator_register.py:

from pydantic import Field

from aiq.builder.builder import EvalBuilder
from aiq.builder.evaluator import EvaluatorInfo
from aiq.cli.register_workflow import register_evaluator
from aiq.data_models.evaluator import EvaluatorBaseConfig


class SimilarityEvaluatorConfig(EvaluatorBaseConfig, name="similarity"):
    '''Configuration for custom similarity evaluator'''
    similarity_type: str = = Field(description="Similarity type to be computed", default="cosine")


@register_evaluator(config_type=SimilarityEvaluatorConfig)
async def register_similarity_evaluator(config: SimilarityEvaluatorConfig, builder: EvalBuilder):
    '''Register custom evaluator'''
    from .similarity_evaluator import SimilarityEvaluator
    evaluator = SimilarityEvaluator(config.similarity_type, builder.get_max_concurrency())

    yield EvaluatorInfo(config=config, evaluate_fn=evaluator.evaluate, description="Simlaity Evaluator")

SimilarityEvaluatorConfig specifies the evaluator name. The similarity_type configuration parameter is used to specify the type of similarity to be computed.

The register_similarity_evaluator function is used to register the evaluator with AgentIQ via the register_evaluator decorator. This function provides an asynchronous evaluation method.

SimilarityEvaluator class and the evaluation method, evaluator.evaluate, are explained in the section Similarity Evaluator.

To ensure that evaluator is registered the evaluator function is imported, but not used, in the simple example’s register.py

examples/simple/src/aiq_simple/register.py:

from .evaluator_register import register_similarity_evaluator  # pylint: disable=unused-import

Understanding `EvalInput` and `EvalOutput`#

The asynchronous evaluate method provide by the custom evaluator takes an EvalInput object as input and returns an EvalOutput object as output.

EvalInput is a list of EvalInputItem objects. Each EvalInputItem object contains the following fields:

id: The unique identifier for the item. It is defined in the dataset file and can be an integer or a string.
input_obj: This is typically the question. It is derived from the dataset file and can be a string or any serializable object.
expected_output_obj: The expected answer for the question. It is derived from the dataset file and can be a string or any serializable object.
output_obj: The answer generated by the workflow for the question. This can be a string or any serializable object.
trajectory: List of intermediate steps returned by the workflow. This is a list of IntermediateStep objects.

EvalOutput contains the following fields:

average_score: The average score of all the items in the evaluation input. This is typically a floating point number between 0 and 1. But it can be any serializable object.
eval_output_items: A list of EvalOutputItem objects. Each EvalOutputItem object contains the following fields:
- id: The unique identifier for the input item.
- score: The score for the item. This is typically a floating point number between 0 and 1. But it can be any serializable object.
- reasoning: The reasoning for the score. This can be any serializable object.

The evaluate method computes the score for each item in the evaluation input and returns an EvalOutput object.

Similarity Evaluator#

Similarity evaluator is used as an example to demonstrate the process of creating and registering a custom evaluator with AgentIQ. We add this code to a new similarity_evaluator.py file in the simple example directory for testing purposes.

examples/simple/src/aiq_simple/similarity_evaluator.py:

import asyncio

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.asyncio import tqdm

from aiq.eval.evaluator.evaluator_model import EvalInput
from aiq.eval.evaluator.evaluator_model import EvalOutput
from aiq.eval.evaluator.evaluator_model import EvalOutputItem


class SimilarityEvaluator:
    '''Similarity evaluator class'''

    def __init__(self, similarity_type: str, max_concurrency: int):
        self.max_concurrency = max_concurrency
        self.similarity_type = similarity_type
        self.vectorizer = TfidfVectorizer()
        self.semaphore = asyncio.Semaphore(self.max_concurrency)

    async def evaluate(self, eval_input: EvalInput) -> EvalOutput:
        '''Evaluate function'''

        async def process_item(item):
            """Compute cosine similarity for an individual item"""
            question = item.input_obj
            answer = item.expected_output_obj
            generated_answer = item.output_obj

            # Compute TF-IDF vectors
            tfidf_matrix = self.vectorizer.fit_transform([answer, generated_answer])

            # Compute cosine similarity score
            similarity_score = round(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0], 2)

            # Provide reasoning for the score
            reasoning = {
                "question": question,
                "answer": answer,
                "generated_answer": generated_answer,
                "similarity_type": "cosine"
            }

            return similarity_score, reasoning

        # Process items concurrently with a limit on concurrency
        results = await tqdm.gather(*[process_item(item) for item in eval_input.eval_input_items])

        # Extract scores and reasonings
        sample_scores, sample_reasonings = zip(*results) if results else ([], [])

        # Compute average score
        avg_score = round(sum(sample_scores) / len(sample_scores), 2) if sample_scores else 0.0

        # Construct EvalOutputItems
        eval_output_items = [
            EvalOutputItem(id=item.id, score=score, reasoning=reasoning)
            for item, score, reasoning in zip(eval_input.eval_input_items, sample_scores, sample_reasonings)
        ]

        return EvalOutput(average_score=avg_score, eval_output_items=eval_output_items)

SimilarityEvaluator class is used to compute the similarity between the expected output and the generated output. The evaluate method computes the cosine similarity between the expected output and the generated output for each item in the evaluation input.

To handle concurrency, the process_item method is used to compute the similarity score for an individual item. The process_item method is executed concurrently using the tqdm.asyncio.gather method.

This asynchronous handling of items is particularly useful when evaluating a large number of items via a service endpoint such as a judge LLM.

The evaluate method returns an EvalOutput object that contains the average score and the similarity score for each item in the evaluation input.

Display all evaluators#

To display all evaluators, run the following command:

aiq info components -t evaluator

This will now display the custom evaluator similarity in the list of evaluators.

Evaluation configuration#

Add the evaluator to the workflow configuration file in the eval.evaluators section. The following is an example of the similarity evaluator configuration: examples/simple/configs/eval_config.yml:

eval:
  evaluators:
    similarity_eval:
      _type: similarity
      similarity_type: cosine

The _type field specifies the evaluator name. The keyword similarity_eval can be set to any string. It is used as a prefix to the evaluator output file name.

Evaluating the workflow#

Run and evaluate the workflow using the following command:

aiq eval --config_file=examples/simple/configs/eval_config.yml

Evaluation results#

The evaluation results are stored in the output directory specified in the workflow configuration file. examples/simple/configs/eval_config.yml:

eval:
  general:
    output_dir: ./.tmp/aiq/examples/simple/

The results of each evaluator is stored in a separate file with name <keyword>_eval_output.json. The following is an example of the similarity evaluator output file: examples/simple/.tmp/aiq/examples/simple/similarity_eval_output.json:

{
  "average_score": 0.63,
  "eval_output_items": [
    {
      "id": 1,
      "score": 0.56,
      "reasoning": {
        "question": "What is langsmith",
        "answer": "LangSmith is a platform for LLM application development, monitoring, and testing",
        "generated_answer": "LangSmith is a platform for LLM application development, monitoring, and testing. It supports various workflows throughout the application development lifecycle, including automations, threads, annotating traces, adding runs to a dataset, prototyping, and debugging.",
        "similarity_type": "cosine"
      }
    },
    {
      "id": 2,
      "score": 0.78,
      "reasoning": {
        "question": "How do I proptotype with langsmith",
        "answer": "To prototype with LangSmith, you can use its tracing feature to quickly understand how the model is performing and debug where it is failing. LangSmith provides clear visibility and debugging information at each step of an LLM sequence, making it easier to identify and root-cause issues.",
        "generated_answer": "To prototype with LangSmith, you can use its tracing feature to quickly understand how the model is performing and debug where it is failing. LangSmith provides clear visibility and debugging information at each step of an LLM sequence, making it easier to identify and root-cause issues. Additionally, LangSmith supports automations, threads, and annotating traces, which can be helpful for processing traces at production scale, tracking the performance of multi-turn applications, and refining and improving the application's performance.",
        "similarity_type": "cosine"
      }
    },
  ]
}

The contents of the file have been snipped for brevity.

Summary#

This guide provides a step-by-step process to create and register a custom evaluator with AgentIQ. The similarity evaluator is used as an example to demonstrate the process. The evaluator configuration, evaluator function, and evaluation results are explained in detail.

Custom Evaluator#

Existing Evaluators#

Extending AgentIQ with Custom Evaluators#

Evaluator Configuration#

Understanding EvalInput and EvalOutput#

Similarity Evaluator#

Display all evaluators#

Evaluation configuration#

Evaluating the workflow#

Evaluation results#

Summary#

Understanding `EvalInput` and `EvalOutput`#