Evaluate Configuration
NeMo Guardrails provides tools for evaluating the performance of a guardrails configuration.
This page covers the policy-based nemoguardrails eval workflow.
For per-rail benchmark tasks such as topical, moderation, fact-checking, and hallucination evaluation, see Evaluate Guardrails.
Introduction
A guardrails configuration evaluation has three main components:
- Compliance rate: how well do the LLM responses comply with the defined policies?
- Resource usage: how many LLM calls and tokens are used?
- Latency impact: what’s the impact on the latency?
To evaluate a guardrails configuration, you must:
- Create a dataset of interactions and policies.
- Run the interactions through the guardrails configuration.
- Check the compliance of the LLM outputs, either manually or by using LLM-as-a-judge.
- Analyze the results.
Evaluation Configuration
An evaluation configuration includes the following:
- Policies
- Interactions
- Expected Latencies
- LLM-as-a-Judge Models and Prompts
The typical structure of an evaluation configuration is:
The name of the files is not important. What matters are the top-level keys inside the files. You can think of an evaluation configuration as a dictionary whose values are spread across multiple files. You can also write your evaluation config files in JSON format.
Policies
Policies represent the aspects of the LLM behavior that you want to control and evaluate. Each policy has an ID and a description in natural language. For example:
Interactions
Interactions are grouped in Interaction Sets which represent inputs coupled with an expected behavior. For example:
Inputs
The inputs key for an interaction set contains one or more inputs which should be evaluated. An input can be a string or a dictionary with messages:
Expected Output
Each interaction set has one or more expected outputs associated with it. Currently, the following types of expected outputs are supported:
general: A general description of the expected output should be provided.refusal: The LLM response is expected to refuse to respond to the user’s request.similar_message: The LLM must respond with a message similar to the one provided.
For every expected output, you must also provide the policy that dictates the expected output.
Expected Latencies
Expected latencies are used to report latency stats consistently across multiple evaluation runs. They are not influenced by network latencies, service load or other factors. They can also be used to model the expected latencies in various deployment types.
The expected latency for an LLM call is:
The expected latencies can be used for reporting in the Eval UI.
LLM-judge
To use an LLM as a judge, you must configure one or more models, as well as the prompt templates to be used.
To configure the models, add a top level models key with the content similar to the one used by a guardrails configuration. The only difference is setting the type to llm-judge.
To configure the prompts used for the LLM-as-a-judge evaluation, you need to add a prompt for the llm_judge_check_single_policy_compliance task. For example:
The templating syntax is the same as for the rest of the library. The available context variables are:
policy: The policy that is being evaluated.history: The complete event history of the interaction.expected_output: A string representation of the expected output for the interaction.expected_output_for_policy: A string representation of the expected out for the interaction as dictate by the current policy.allow_not_applicable: Whether the LLM should be able to decide the policy is “not applicable” or not.
You can also create a custom evaluation prompt for a specific policy by creating a prompt for the task llm_judge_check_single_policy_compliance/{POLICY_ID}:
CLI
You can run evaluations and inspect the results using the CLI nemoguardrails eval.
Usage:
Commands
run: Run a set of interactions through a guardrails configuration.check-compliance: Check compliance against the policies using LLM-as-a-judge.ui: Launch the Eval UI for reviewing and analyzing the results.rail: Run a rail evaluation task. For rail-specific benchmark tasks, see Evaluate Guardrails.
Analyze the Results
To analyze the results of a guardrails configuration evaluation, use the Eval UI, which you can launch using nemoguardrails eval ui.
For guidance on defining policies and interpreting compliance results, see Evaluation Methodology.