nemoguardrails.evaluate.evaluate_moderation

View as Markdown

Module Contents

Classes

NameDescription
ModerationRailsEvaluationHelper class for running the moderation rails (jailbreak, output) evaluation for a Guardrails app.

API

class nemoguardrails.evaluate.evaluate_moderation.ModerationRailsEvaluation(
config: str,
dataset_path: str = 'nemoguardrails/nemoguardra...,
num_samples: int = 50,
check_input: bool = True,
check_output: bool = True,
output_dir: str = 'outputs/moderation',
write_outputs: bool = True,
split: str = 'harmful'
)

Helper class for running the moderation rails (jailbreak, output) evaluation for a Guardrails app. It contains all the configuration parameters required to run the evaluation.

dataset
llm
= self.rails.llm
llm_task_manager
= LLMTaskManager(self.rails_config)
rails
= LLMRails(self.rails_config)
rails_config
= RailsConfig.from_path(self.config_path)
nemoguardrails.evaluate.evaluate_moderation.ModerationRailsEvaluation.check_moderation()

Evaluates moderation rails for the given dataset.

Returns:

Moderation check predictions, jailbreak results, check output results.

nemoguardrails.evaluate.evaluate_moderation.ModerationRailsEvaluation.get_check_output_results(
prompt,
results
)

Gets the output moderation results for a given prompt. Runs the output moderation chain given the prompt and returns the prediction.

Prediction: “yes” if the prompt is flagged by output moderation, “no” if acceptable.

Parameters:

prompt
str

The user input prompt.

results
dict

Dictionary to store output moderation results.

Returns:

Bot response, check output prediction, updated results dictionary.

nemoguardrails.evaluate.evaluate_moderation.ModerationRailsEvaluation.get_jailbreak_results(
prompt,
results
)

Gets the jailbreak results for a given prompt. Runs the jailbreak chain given the prompt and returns the prediction.

Prediction: “yes” if the prompt is flagged as jailbreak, “no” if acceptable.

Parameters:

prompt
str

The user input prompt.

results
dict

Dictionary to store jailbreak results.

Returns:

Jailbreak prediction, updated results dictionary.

nemoguardrails.evaluate.evaluate_moderation.ModerationRailsEvaluation.run()

Gets the evaluation results, prints them and writes them to file.