nat.eval.red_teaming_evaluator.evaluate#
Attributes#
Classes#
Reduction strategy for selecting a single intermediate step from filtered steps. |
|
Red teaming evaluator class that evaluates filtered intermediate steps from a trajectory using an LLM judge. |
Module Contents#
- logger#
- class ReductionStrategy#
-
Reduction strategy for selecting a single intermediate step from filtered steps.
Initialize self. See help(type(self)) for accurate signature.
- FIRST = 'first'#
- MAX = 'max'#
- LAST = 'last'#
- class RedTeamingEvaluator(
- llm: langchain_core.language_models.BaseChatModel,
- judge_llm_prompt: str,
- filter_conditions: list[nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition] | None = None,
- llm_retry_control_params: dict | None = None,
- max_concurrency: int = 4,
- reduction_strategy: ReductionStrategy = ReductionStrategy.LAST,
- scenario_specific_instructions: str | None = None,
Bases:
nat.eval.evaluator.base_evaluator.BaseEvaluatorRed teaming evaluator class that evaluates filtered intermediate steps from a trajectory using an LLM judge.
Initialize the RedTeamingEvaluator.
- Args:
llm: The LLM to use for judging judge_llm_prompt: The prompt to use for the judge LLM llm_retry_control_params: Parameters for retry logic max_concurrency: Maximum number of concurrent evaluations filter_conditions: List of filter conditions for selecting intermediate steps reduction_strategy: Strategy to select a single step from filtered steps. scenario_specific_instructions: Optional scenario-specific instructions for evaluation.
- llm#
- judge_llm_prompt#
- llm_retry_control_params = None#
- filter_conditions#
- scenario_specific_instructions = None#
- reduction_strategy#
- _evaluation_prompt(
- judge_llm_prompt: str,
- question: str,
- expected_behavior: str,
- generated_output: str,
- format_instructions: str,
- scenario_specific_instructions: str | None = None,
This function generates a prompt for the judge LLM to evaluate the generated output. The prompt is structured for red teaming evaluation to detect successful attack attempts.
- filter_trajectory(
- trajectory: list[nat.data_models.intermediate_step.IntermediateStep],
- condition: nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition,
Filter a trajectory using the specified filter condition.
- Args:
trajectory: List of intermediate steps to filter condition: Filter condition to apply
- Returns:
List of filtered intermediate steps
- extract_output_from_step( ) str#
Extract the output from an intermediate step.
- Args:
step: The intermediate step to extract output from
- Returns:
String representation of the output
- async evaluate_single_output( ) tuple[float, str, str | None]#
Evaluate a single output using the LLM judge.
- Args:
question: The original question/input expected_behavior: The expected behavior description generated_output: The generated output to evaluate
- Returns:
Tuple of (score, reasoning)
- async _evaluate_filter_condition(
- condition: nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition,
- question: str,
- expected_behavior: str,
- trajectory: list[nat.data_models.intermediate_step.IntermediateStep],
- item_id: str,
Evaluate a single filter condition on a trajectory. Assumes only one intermediate step per condition evaluation.
- Args:
condition: The filter condition to apply question: The original question/input expected_behavior: The expected behavior description trajectory: The trajectory to evaluate item_id: The ID of the evaluation item
- Returns:
Tuple of (condition_score, ConditionEvalOutputItem)
- async evaluate_item( ) nat.eval.red_teaming_evaluator.data_models.RedTeamingEvalOutputItem#
Compute red teaming evaluation for an individual item and return RedTeamingEvalOutputItem
- _runnable_with_retries(
- original_fn: collections.abc.Callable,
- llm_retry_control_params: dict | None = None,
Create a runnable with retry logic.