classifiers.aegis#

Module Contents#

Classes#

AegisClassifier

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

AegisConfig

AegisHFModel

AegisModel

InstructionDataGuardClassifier

Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain ‘secret’ prompts are given.

InstructionDataGuardNet

Data#

API#

classifiers.aegis.ACCESS_ERROR_MESSAGE = <Multiline-String>#
classifiers.aegis.AEGIS_LABELS#

[‘unknown’, ‘safe’, ‘O1’, ‘O2’, ‘O3’, ‘O4’, ‘O5’, ‘O6’, ‘O7’, ‘O8’, ‘O9’, ‘O10’, ‘O11’, ‘O12’, ‘O13’…

class classifiers.aegis.AegisClassifier(
aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0',
token: str | bool | None = None,
filter_by: list[str] | None = None,
batch_size: int = 64,
text_field: str = 'text',
pred_column: str = 'aegis_pred',
raw_pred_column: str = '_aegis_raw_pred',
keep_raw_pred: bool = False,
max_chars: int = 6000,
device_type: str = 'cuda',
autocast: bool = True,
max_mem_gb: int | None = None,
)#

Bases: nemo_curator.classifiers.base.DistributedDataClassifier

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.

Initialization

Constructs the classifier

Args: aegis_variant (str): The HuggingFace ‘pretrained_model_name_or_path’ for the AEGIS model. Can be either ‘nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0’ or ‘nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0’ token (Optional[Union[str, bool]]): A HuggingFace user access token. A user access token is needed to access the base model for AEGIS (meta-llama/LlamaGuard-7b). You can get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b filter_by (Optional[List[str]]): If specified, the resulting dataset will remove all values expect those specified in this list. batch_size (int): The batch size to use when running the classifier. text_field (str): The field in the dataset that should be classified. pred_column (str): The name of the column to store the resulting prediction. raw_pred_column (str): The name of the column to store the raw output of the AEGIS LLM before the prediction is extracted from it. keep_raw_pred (bool): If True, will keep the unprocessed LLM output in raw_pred_column. Useful for debugging when “unknown” shows up a lot in your dataset. max_chars (int): If the document is larger than max_chars, the classifier will only classify the first max_chars. autocast (bool): If True, will use autocast to run the classifier. device_type (str): The device to run the classifier on. Currently, it can only be “cuda”. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

class classifiers.aegis.AegisConfig#
add_instruction_data_guard: bool#

False

dtype: torch.dtype#

None

instruction_data_guard_path: str#

‘nvidia/instruction-data-guard’

max_length: int#

4096

peft_model_name_or_path: str#

None

pretrained_model_name_or_path: str#

‘meta-llama/LlamaGuard-7b’

token: str | bool | None#

None

class classifiers.aegis.AegisHFModel(
config: classifiers.aegis.AegisConfig,
max_mem_gb: int | None = None,
)#

Bases: crossfit.backend.torch.hf.model.HFModel

Initialization

load_cfg() transformers.AutoConfig#
load_config() transformers.AutoConfig#
load_model(device: str = 'cuda') classifiers.aegis.AegisModel#
load_tokenizer() transformers.AutoTokenizer#
max_seq_length() int#
class classifiers.aegis.AegisModel(
pretrained_model_name_or_path: str,
peft_model_name_or_path: str,
dtype: torch.dtype,
token: str | bool | None,
add_instruction_data_guard: bool = False,
autocast: bool = False,
)#

Bases: torch.nn.Module

Initialization

forward(batch: dict[str, torch.Tensor]) torch.Tensor#
class classifiers.aegis.InstructionDataGuardClassifier(
token: str | bool | None = None,
batch_size: int = 64,
text_field: str = 'text',
pred_column: str = 'is_poisoned',
prob_column: str = 'instruction_data_guard_poisoning_score',
max_chars: int = 6000,
autocast: bool = True,
device_type: str = 'cuda',
max_mem_gb: int | None = None,
)#

Bases: nemo_curator.classifiers.base.DistributedDataClassifier

Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain ‘secret’ prompts are given.

The pretrained model used by this class is called NemoCurator Instruction Data Guard. It can be found on Hugging Face here: https://huggingface.co/nvidia/instruction-data-guard.

IMPORTANT: This model is specifically designed for and tested on English language instruction-response datasets. Performance on non-English content has not been validated.

The model analyzes text data and assigns a poisoning probability score from 0 to 1, where higher scores indicate a greater likelihood of poisoning. It is specifically trained to detect various types of LLM poisoning trigger attacks in English instruction-response datasets.

Model Capabilities:

  • Trained on multiple known poisoning attack patterns

  • Demonstrated strong zero-shot detection capabilities on novel attacks

  • Particularly effective at identifying trigger patterns in partially poisoned datasets

Dataset Format: The model expects instruction-response style text data. For example: “Instruction: {instruction}. Input: {input_}. Response: {response}.”

Usage Recommendations:

  1. Apply to English instruction-response datasets

  2. Manually review positively flagged samples (3-20 random samples recommended)

  3. Look for patterns in flagged content to identify potential trigger words

  4. Clean the dataset based on identified patterns rather than relying solely on scores

Note: False positives are expected. The model works best as part of a broader data quality assessment strategy rather than as a standalone filter.

Technical Details: Built on NVIDIA’s AEGIS safety classifier, which is a parameter-efficient instruction-tuned version of Llama Guard (Llama2-7B). Access to the base Llama Guard model on HuggingFace (https://huggingface.co/meta-llama/LlamaGuard-7b) is required via a user access token.

Initialization

Constructs the classifier

Args: token (Optional[Union[str, bool]]): A HuggingFace user access token. A user access token is needed to access the base model for AEGIS (meta-llama/LlamaGuard-7b). You can get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b filter_by (Optional[List[str]]): If specified, the resulting dataset will remove all values expect those specified in this list. batch_size (int): The batch size to use when running the classifier. text_field (str): The field in the dataset that should be classified. pred_column (str): The name of the column to store the resulting prediction. prob_column (str): The name of the column to store the poisoning probability score. max_chars (int): If the document is larger than max_chars, the classifier will only classify the first max_chars. autocast (bool): If True, will use autocast to run the classifier. device_type (str): The device to run the classifier on. Currently, it can only be “cuda”. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

class classifiers.aegis.InstructionDataGuardNet(input_dim: int, dropout: float = 0.7)#

Bases: torch.nn.Module, huggingface_hub.PyTorchModelHubMixin

Initialization

forward(x: torch.Tensor) torch.Tensor#