Evaluating NVIDIA Agent Intelligence Toolkit Workflows#

AIQ toolkit provides a set of evaluators to run and evaluate the AIQ toolkit workflows. In addition to the built-in evaluators, AIQ toolkit provides a plugin system to add custom evaluators.

Evaluating a Workflow#

To evaluate a workflow, you can use the aiq eval command. The aiq eval command takes a workflow configuration file as input. It runs the workflow using the dataset specified in the configuration file. The workflow output is then evaluated using the evaluators specified in the configuration file.

To run and evaluate the simple example workflow, use the following command:

aiq eval --config_file=examples/simple/configs/eval_config.yml

Understanding the Evaluation Configuration#

The eval section in the configuration file specifies the dataset and the evaluators to use. The following is an example of an eval section in a configuration file:

examples/simple/configs/eval_config.yml:

eval:
  general:
    output_dir: ./.tmp/aiq/examples/simple/
    dataset:
      _type: json
      file_path: examples/simple/data/langsmith.json
  evaluators:
    rag_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_rag_eval_llm

The dataset section specifies the dataset to use for running the workflow. The dataset can be of type json, jsonl, csv, xls, or parquet. The dataset file path is specified using the file_path key.

Understanding the Dataset Format#

The dataset file provides a list of questions and expected answers. The following is an example of a dataset file:

examples/simple/data/langsmith.json:

[
  {
    "id": "1",
    "question": "What is langsmith",
    "answer": "LangSmith is a platform for LLM application development, monitoring, and testing"
  },
  {
    "id": "2",
    "question": "How do I proptotype with langsmith",
    "answer": "To prototype with LangSmith, you can quickly experiment with prompts, model types, retrieval strategy, and other parameters"
  }
]

Understanding the Evaluator Configuration#

The evaluators section specifies the evaluators to use for evaluating the workflow output. The evaluator configuration includes the evaluator type, the metric to evaluate, and any additional parameters required by the evaluator.

Display all evaluators#

To display all existing evaluators, run the following command:

aiq info components -t evaluator

Ragas Evaluator#

RAGAS is an OSS evaluation framework that enables end-to-end evaluation of RAG workflows. AIQ toolkit provides an interface to RAGAS to evaluate the performance of RAG-like AIQ toolkit workflows.

examples/simple/configs/eval_config.yml:

eval:
  evaluators:
    rag_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_rag_eval_llm
    rag_groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: nim_rag_eval_llm
    rag_relevance:
      _type: ragas
      metric: ContextRelevance
      llm_name: nim_rag_eval_llm

The following ragas metrics are recommended for RAG workflows:

AnswerAccuracy: Evaluates the accuracy of the answer generated by the workflow against the expected answer or ground truth.

ContextRelevance: Evaluates the relevance of the context retrieved by the workflow against the question.

ResponseGroundedness: Evaluates the groundedness of the response generated by the workflow based on the context retrieved by the workflow.

These metrics use a judge LLM for evaluating the generated output and retrieved context. The judge LLM is configured in the llms section of the configuration file and is referenced by the llm_name key in the evaluator configuration.

examples/simple/configs/eval_config.yml:

llms:
  nim_rag_eval_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    max_tokens: 8

For these metrics, it is recommended to use 8 tokens for the judge LLM.

Evaluation is dependent on the judge LLM’s ability to accurately evaluate the generated output and retrieved context. This is the leadership board for the judge LLM:

    1)- mistralai/mixtral-8x22b-instruct-v0.1
    2)- mistralai/mixtral-8x7b-instruct-v0.1
    3)- meta/llama-3.1-70b-instruct
    4)- meta/llama-3.3-70b-instruct

For a complete list of up-to-date judge LLMs, refer to the RAGAS NV metrics leadership board

Trajectory Evaluator#

This evaluator uses the intermediate steps generated by the workflow to evaluate the workflow trajectory. The evaluator configuration includes the evaluator type and any additional parameters required by the evaluator.

examples/simple/configs/eval_config.yml:

eval:
  evaluators:
    trajectory:
      _type: trajectory
      llm_name: nim_trajectory_eval_llm

A judge LLM is used to evaluate the trajectory based on the tools available to the workflow.

The judge LLM is configured in the llms section of the configuration file and is referenced by the llm_name key in the evaluator configuration.

Workflow Output#

The aiq eval command runs the workflow on all the entries in the dataset. The output of these runs is stored in a file named workflow_output.json under the output_dir specified in the configuration file.

examples/simple/configs/eval_config.yml:

eval:
  general:
    output_dir: ./.tmp/aiq/examples/simple/

If additional output configuration is needed you can specify the eval.general.output section in the configuration file. If the eval.general.output section is specified, the dir configuration from that section overrides the output_dir specified in the eval.general section.

eval:
  general:
    output:
      dir: ./.tmp/aiq/examples/simple/

Here is a sample workflow output generated by running an evaluation on the simple example workflow:

./.tmp/aiq/examples/simple/workflow_output.json:

  {
    "id": "1",
    "question": "What is langsmith",
    "answer": "LangSmith is a platform for LLM application development, monitoring, and testing",
    "generated_answer": "LangSmith is a platform for LLM (Large Language Model) application development, monitoring, and testing. It provides features such as automations, threads, annotating traces, adding runs to a dataset, prototyping, and debugging to support the development lifecycle of LLM applications.",
    "intermediate_steps": [
      {
        >>>>>>>>>>>>>>> SNIPPED >>>>>>>>>>>>>>>>>>>>>>
      }
    ],
    "expected_intermediate_steps": []
  },

The contents of the file have been snipped for brevity.

Evaluator Output#

Each evaluator provides an average score across all the entries in the dataset. The evaluator output also includes the score for each entry in the dataset along with the reasoning for the score. The score is a floating point number between 0 and 1, where 1 indicates a perfect match between the expected output and the generated output.

The output of each evaluator is stored in a separate file under the output_dir specified in the configuration file.

Here is a sample evaluator output generated by running evaluation on the simple example workflow:

./.tmp/aiq/examples/simple/rag_accuracy_output.json:

{
  "average_score": 0.6666666666666666,
  "eval_output_items": [
    {
      "id": 1,
      "score": 0.5,
      "reasoning": {
        "question": "What is langsmith",
        "answer": "LangSmith is a platform for LLM application development, monitoring, and testing",
        "generated_answer": "LangSmith is a platform for LLM application development, monitoring, and testing. It supports various workflows throughout the application development lifecycle, including automations, threads, annotating traces, adding runs to a dataset, prototyping, and debugging.",
        "retrieved_contexts": [
          >>>>>>> SNIPPED >>>>>>>>
        ]
      }
    },
    {
      "id": 2,
      "score": 0.75,
      "reasoning": {
        "question": "How do I proptotype with langsmith",
        "answer": "To prototype with LangSmith, you can quickly experiment with prompts, model types, retrieval strategy, and other parameters",
        "generated_answer": "LangSmith is a platform for LLM application development, monitoring, and testing. It supports prototyping, debugging, automations, threads, and capturing feedback. To prototype with LangSmith, users can quickly experiment with different prompts, model types, and retrieval strategies, and debug issues using tracing and application traces. LangSmith also provides features such as automations, threads, and feedback capture to help users develop and refine their LLM applications.",
        "retrieved_contexts": [
          >>>>>>> SNIPPED >>>>>>>>
        ]
      }
    }
  ]
}

The contents of the file have been snipped for brevity.

Evaluating Remote Workflows#

You can evaluate remote workflows by using the aiq eval command with the --endpoint flag. In this mode the workflow is run on the remote server specified in the --endpoint configuration and evaluation is done on the local server.

Launch AIQ toolkit on the remote server with the configuration file:

aiq serve --config_file=examples/simple/configs/config.yml

Run the evaluation with the --endpoint flag and the configuration file with the evaluation dataset:

aiq eval --config_file=examples/simple/configs/eval_config.yml --endpoint http://localhost:8000

Evaluation Endpoint#

You can also evaluate workflows using the AIQ toolkit evaluation endpoint. The evaluation endpoint is a REST API that allows you to evaluate workflows using the same configuration file as the aiq eval command. The evaluation endpoint is available at /evaluate on the AIQ toolkit server. For more information, refer to the AIQ toolkit Evaluation Endpoint documentation.

Adding Custom Evaluators#

You can add custom evaluators to evaluate the workflow output. To add a custom evaluator, you need to implement the evaluator and register it with the AIQ toolkit evaluator system. See the Custom Evaluator documentation for more information.

Overriding Evaluation Configuration#

You can override the configuration in the eval_config.yml file using the --override command line flag. The following is an example of overriding the configuration:

aiq eval --config_file examples/simple/configs/eval_config.yml \
        --override llms.nim_rag_eval_llm.temperature 0.7 \
        --override llms.nim_rag_eval_llm.model_name meta/llama-3.1-70b-instruct

Additional Evaluation Options#

For details on other evaluators and evaluation options, refer to AIQ toolkit Evaluation Concepts for more information.

Profiling and Performance Monitoring of AIQ Toolkit Workflows#

You can profile workflows via the AIQ toolkit evaluation system. For more information, refer to the Profiler documentation.