Evaluation Methodology
As mentioned in Evaluate Configuration, we propose a guardrails evaluation methodology based on policies. This page clarifies the key concepts of a policy-based methodology for guardrails configuration.
Key Concepts
We start from the premise that guardrails configurations are added to an LLM-based system to enforce compliance with a set of policies.
A policy describes an aspect of how the system should perform (e.g., should not engage with toxic content, should not talk about anything unrelated to X, should respond with factually correct information).
At the same time, the evaluation of any guardrails configuration requires a dataset of interactions. These define the expected behavior of the application for specific user inputs, either single-turn or longer interactions (both turns and sequences of actions).
In the following two sections, we will talk about the metrics used to evaluate the performance of a guardrails configuration.
Measuring Guardrail Compliance
As for any evaluation, choosing the right metric is one of the most important parts. In the current methodology, we have opted for a simple and unitary evaluation metric that can be applied for any policy-based guardrails evaluation.
To this extent, we introduce the compliance rate for a policy defined as the percentage of interactions from the evaluation dataset that comply with the policy (out of all the interactions for which the policy is applicable). For an application with various guardrails policies, the general compliance rate is a weighted average of the individual per policy compliance rate.
While overall compliance rate can be useful as a single metric to compare different guardrails configurations, for a comprehensive evaluation, it is always important to also analyze the compliance rate for each policy and even for specific interactions.
Compliance Rate vs. Other Evaluation Metrics
We have chosen compliance rate because it is relevant for measuring guardrails effectiveness, while also being simple to understand. From a machine learning metrics perspective, the compliance rate is similar to accuracy. For example, for a moderation policy it counts all correctly unsafe and blocked interactions and all correctly safe and unblocked interactions from the total considered ones.
If you are interested in other metrics for a specific policy, such as precision, recall or F1 scores for moderation, or relevance and grounding for RAG (Retrieval Augmented Applications), you need to compute these with other tools.
Metrics Related to Efficiency
In addition to the compliance rate, additional metrics are computed to account for the efficiency of a guardrails configuration. Such metrics include the number of LLM calls, invoked actions, total tokens in LLM calls, and overall latency. These efficiency-related metrics are intended to help determine the costs of achieving a specific compliance rate. In general, higher compliance rates might require larger consumption of resources, although smaller models developed for specific rails (e.g., moderation, fact-checking, topical rails) can provide both an increased compliance rate without increasing costs or latency.
Evaluation Configuration
An evaluation configuration contains information about policies, interactions, LLM-as-a-judge models, and latencies. In this section, we will focus on the best practices for defining these key components of any guardrails evaluation.
As an example, we will consider the guardrails configuration of the ABC bot. This bot configuration is designed to only answer questions the company’s general information and HR policy, and not engage in other topics. It contains guardrails using a system prompt, describing the general bot functionality, moderation rails using self-check, and topical rails not to engage in other topics.
The steps required to create a guardrails evaluation configuration are the following:
- Define the evaluation policies
- Create the interactions’ dataset
- Use LLM-as-a-judge for an initial evaluation
- Manually annotate complex interactions
- (Optional) Define and evaluate alternative guardrails configurations for comparison
Defining the Policies
The first step in defining an evaluation configuration is to define the policies that are enforced by the guardrails configuration.
These are defined in the policies.yaml file and define the most important behaviors of the LLM application.
As an example, for the ABC bot, we define several policies that must assess the bot’s behavior:
- Input moderation: not engaging with toxic or unsafe content.
- On-topic: always respond to on-topic content and not engage in off-topic, except for basic chit-chat.
- Chit-chat: allow the bot to engage in simple chit-chat conversations.
- HR questions: respond to any user questions about the employee handbook, and use the expected output defined for each interaction as the desired response.
The evaluation configuration contains some other policies that are not detailed in this guide (e.g., output moderation or responding to general company information queries). Depending on the guardrails configuration to be assessed, other policies could be added, for example, not discussing HR policies from any other companies.
The definition of the policies mentioned above is presented in the following snippet.
The description field for any policy is very important, as it should contain all the rules that are enforced by each policy.
This field is later used in the assessment of the compliance for each policy and interaction, either when using LLM as a judge for automatic assessment or when employing manual annotations.
We have observed that using simple sentences to describe a policy and even a list of numbered rules helps both LLM judges and human annotators. It is important to highlight that the evaluation policies should be manually defined by experts.
While some policies are applied to all interactions in the evaluation dataset (e.g. input-moderation and on-topic in the sample above), others are designed to be used only for the interactions defined for that specific policy (e.g. chit-chat and hr-questions).
The attribute apply_to_all is used to specify this behavior.
Creating the Interactions Dataset
The instruction datasets are also essential for a guardrails evaluation. Interactions can contain single or multi-turn interactions and, together with the definition of the policies, they define the expected behavior.
The expected behavior combines the information in the policy definition (e.g. id and description) with the one in the interactions.
For example, the following set of interactions are defined to evaluate the input-moderation policy and contain the additional information that the expected output following each interaction should be a refusal to respond (defined by the expected_output attribute).
The expected_output attribute, which defines the expected behavior, can contain either an expected behavior (e.g., refusal) or an expected message that the actual response is compared to.
When creating interaction sets, both synthetic data generation and real data selected and annotated by humans are good strategies.
Synthetic Generation of Interaction Sets
The simplest strategy to create a set of interactions for evaluating guardrails is to generate synthetic interaction data. This should be the easiest method and can be employed when no human annotated datasets are available.
In our ABC bot evaluation experiments, we have successfully generated synthetic data for the chit-chat and hr-questions policies.
When doing this, it is important to specify the task (policy), and to include the expected output together with each interaction.
However, it is important to analyze the synthetically generated interactions, iteratively improve the prompts, and also filter out some interactions (that are either incorrect or too simple).
Using Real Data
Of course, whenever possible, the set of interactions should use real data from users or annotated by experts. For some policies, existing datasets for that task should also be used (e.g., moderation).
Using LLM as a Judge
The simplest way to compute the compliance rate, e.g., if the actual response generated by the tested guardrails configuration is complying with the expected output, is to use a powerful LLM as a judge.
In order for this automatic evaluation to provide a high percentage of correct results, we suggest the following strategies:
- Try to create strong (e.g., longer and clear) rules for the LLM as a judge.
- Evaluate if there are important annotation inconsistencies when running the LLM judge several times on the same data.
- Also compare the results at least with a small dataset of manual annotations to understand the actual reliability for the LLM judge for each policy.
For example, for the ABC bot we use the following prompt to evaluate the compliance for the chit-chat policy with a LLM judge:
Manual Annotations
Manual annotations of the actual output should always be employed, even when using LLM as a judge. A simple strategy that we recommend is to run several LLM as a judge evaluations for the same evaluation configuration. Then detect the inconsistencies within these automatic evaluation.
At least for the interactions where the LLM judge is inconsistent, we suggest to use manual annotations.
Evaluating Alternative Guardrails Configurations
Finally, to better understand what are the differences between different guardrails configurations, it is important to evaluate several configurations (including a baseline using a system prompt). This way, you can have an assessment for the improvement of the compliance rate, but also if there are more resources used or a larger latency to achieve the improved compliance rate.