nat.eval.red_teaming_evaluator.evaluate#

Attributes#

`ReductionStrategy`	Reduction strategy for selecting a single intermediate step from filtered steps.
`RedTeamingEvaluator`	Red teaming evaluator class that evaluates filtered intermediate steps from a trajectory using an LLM judge.

class ReductionStrategy#

Bases: str, enum.Enum

Reduction strategy for selecting a single intermediate step from filtered steps.

Initialize self. See help(type(self)) for accurate signature.

class RedTeamingEvaluator( llm: langchain_core.language_models.BaseChatModel, judge_llm_prompt: str, filter_conditions: list[nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition] | None = None, llm_retry_control_params: dict | None = None, max_concurrency: int = 4, reduction_strategy: ReductionStrategy = ReductionStrategy.LAST, scenario_specific_instructions: str | None = None, )#

Red teaming evaluator class that evaluates filtered intermediate steps from a trajectory using an LLM judge.

Initialize the RedTeamingEvaluator.

Args:: llm: The LLM to use for judging judge_llm_prompt: The prompt to use for the judge LLM llm_retry_control_params: Parameters for retry logic max_concurrency: Maximum number of concurrent evaluations filter_conditions: List of filter conditions for selecting intermediate steps reduction_strategy: Strategy to select a single step from filtered steps. scenario_specific_instructions: Optional scenario-specific instructions for evaluation.

_evaluation_prompt( judge_llm_prompt: str, question: str, expected_behavior: str, generated_output: str, format_instructions: str, scenario_specific_instructions: str | None = None, ) → str#: This function generates a prompt for the judge LLM to evaluate the generated output. The prompt is structured for red teaming evaluation to detect successful attack attempts.

Filter a trajectory using the specified filter condition.

Args:: trajectory: List of intermediate steps to filter condition: Filter condition to apply
Returns:: List of filtered intermediate steps

Extract the output from an intermediate step.

async evaluate_single_output( question: str, expected_behavior: str, generated_output: str, ) → tuple[float, str, str | None]#

Evaluate a single output using the LLM judge.

Args:: question: The original question/input expected_behavior: The expected behavior description generated_output: The generated output to evaluate
Returns:: Tuple of (score, reasoning)

Evaluate a single filter condition on a trajectory. Assumes only one intermediate step per condition evaluation.

Args:: condition: The filter condition to apply question: The original question/input expected_behavior: The expected behavior description trajectory: The trajectory to evaluate item_id: The ID of the evaluation item
Returns:: Tuple of (condition_score, ConditionEvalOutputItem)

async evaluate_item( item: nat.eval.evaluator.evaluator_model.EvalInputItem, ) → nat.eval.red_teaming_evaluator.data_models.RedTeamingEvalOutputItem#: Compute red teaming evaluation for an individual item and return RedTeamingEvalOutputItem

_runnable_with_retries( original_fn: collections.abc.Callable, llm_retry_control_params: dict | None = None, )#: Create a runnable with retry logic.