nat.eval.red_teaming_evaluator.evaluate#

Attributes#

Classes#

ReductionStrategy

Reduction strategy for selecting a single intermediate step from filtered steps.

RedTeamingEvaluator

Red teaming evaluator class that evaluates filtered intermediate steps from a trajectory using an LLM judge.

Module Contents#

logger#
class ReductionStrategy#

Bases: str, enum.Enum

Reduction strategy for selecting a single intermediate step from filtered steps.

Initialize self. See help(type(self)) for accurate signature.

FIRST = 'first'#
MAX = 'max'#
LAST = 'last'#
class RedTeamingEvaluator(
llm: langchain_core.language_models.BaseChatModel,
judge_llm_prompt: str,
filter_conditions: list[nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition] | None = None,
llm_retry_control_params: dict | None = None,
max_concurrency: int = 4,
reduction_strategy: ReductionStrategy = ReductionStrategy.LAST,
scenario_specific_instructions: str | None = None,
)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Red teaming evaluator class that evaluates filtered intermediate steps from a trajectory using an LLM judge.

Initialize the RedTeamingEvaluator.

Args:

llm: The LLM to use for judging judge_llm_prompt: The prompt to use for the judge LLM llm_retry_control_params: Parameters for retry logic max_concurrency: Maximum number of concurrent evaluations filter_conditions: List of filter conditions for selecting intermediate steps reduction_strategy: Strategy to select a single step from filtered steps. scenario_specific_instructions: Optional scenario-specific instructions for evaluation.

llm#
judge_llm_prompt#
llm_retry_control_params = None#
filter_conditions#
scenario_specific_instructions = None#
reduction_strategy#
_evaluation_prompt(
judge_llm_prompt: str,
question: str,
expected_behavior: str,
generated_output: str,
format_instructions: str,
scenario_specific_instructions: str | None = None,
) str#

This function generates a prompt for the judge LLM to evaluate the generated output. The prompt is structured for red teaming evaluation to detect successful attack attempts.

filter_trajectory(
trajectory: list[nat.data_models.intermediate_step.IntermediateStep],
condition: nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition,
) list[nat.data_models.intermediate_step.IntermediateStep]#

Filter a trajectory using the specified filter condition.

Args:

trajectory: List of intermediate steps to filter condition: Filter condition to apply

Returns:

List of filtered intermediate steps

extract_output_from_step(
step: nat.data_models.intermediate_step.IntermediateStep,
) str#

Extract the output from an intermediate step.

Args:

step: The intermediate step to extract output from

Returns:

String representation of the output

async evaluate_single_output(
question: str,
expected_behavior: str,
generated_output: str,
) tuple[float, str, str | None]#

Evaluate a single output using the LLM judge.

Args:

question: The original question/input expected_behavior: The expected behavior description generated_output: The generated output to evaluate

Returns:

Tuple of (score, reasoning)

async _evaluate_filter_condition(
condition: nat.eval.red_teaming_evaluator.filter_conditions.IntermediateStepsFilterCondition,
question: str,
expected_behavior: str,
trajectory: list[nat.data_models.intermediate_step.IntermediateStep],
item_id: str,
) nat.eval.red_teaming_evaluator.data_models.ConditionEvalOutputItem#

Evaluate a single filter condition on a trajectory. Assumes only one intermediate step per condition evaluation.

Args:

condition: The filter condition to apply question: The original question/input expected_behavior: The expected behavior description trajectory: The trajectory to evaluate item_id: The ID of the evaluation item

Returns:

Tuple of (condition_score, ConditionEvalOutputItem)

async evaluate_item(
item: nat.eval.evaluator.evaluator_model.EvalInputItem,
) nat.eval.red_teaming_evaluator.data_models.RedTeamingEvalOutputItem#

Compute red teaming evaluation for an individual item and return RedTeamingEvalOutputItem

_runnable_with_retries(
original_fn: collections.abc.Callable,
llm_retry_control_params: dict | None = None,
)#

Create a runnable with retry logic.