nemo_curator.stages.text.classifiers.aegis
nemo_curator.stages.text.classifiers.aegis
Module Contents
Classes
Data
INSTRUCTION_DATA_GUARD_MODEL_IDENTIFIER
API
Bases: CompositeStage[DocumentBatch, DocumentBatch]
NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993
In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.
Parameters:
The HuggingFace ‘pretrained_model_name_or_path’ for the AEGIS model. Can be either ‘nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0’ or ‘nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0’
The directory to cache the model. Defaults to None.
A HuggingFace user access token. A user access token is needed to access the base model for AEGIS (meta-llama/LlamaGuard-7b). You can get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b
The name of the column to store the resulting prediction. Defaults to “aegis_pred”.
The name of the column to store the raw output of the AEGIS LLM before the prediction is extracted from it. Defaults to “_aegis_raw_pred”.
If True, will keep the unprocessed LLM output in raw_output_field. Useful for debugging when “unknown” shows up a lot in your dataset. Defaults to False.
The field in the dataset that should be classified. Defaults to “text”.
If specified, the resulting dataset will remove all values expect those specified in this list. Defaults to None.
The maximum number of characters to use from the input text. Defaults to 6000.
If True, will sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
The batch size to use when running the classifier. Defaults to 64.
If True, will use autocast to run the classifier. Defaults to True.
If True, will keep the input tokens in the output dataframe. Defaults to False.
The field in the dataset that contains the formatted prompts for the AEGIS model, if they are already in the dataset. Defaults to None.
If True, will keep the formatted prompts in the output dataframe. Defaults to False.
Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.
Bases: Module
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
FormatAegisPromptStage is a stage that truncates and wraps the input text in a prompt for the AEGIS model.
Bases: CompositeStage[DocumentBatch, DocumentBatch]
Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain ‘secret’ prompts are given.
The pretrained model used by this class is called NemoCurator Instruction Data Guard. It can be found on Hugging Face here: https://huggingface.co/nvidia/instruction-data-guard.
IMPORTANT: This model is specifically designed for and tested on English language instruction-response datasets. Performance on non-English content has not been validated.
The model analyzes text data and assigns a poisoning probability score from 0 to 1, where higher scores indicate a greater likelihood of poisoning. It is specifically trained to detect various types of LLM poisoning trigger attacks in English instruction-response datasets.
Model Capabilities:
- Trained on multiple known poisoning attack patterns
- Demonstrated strong zero-shot detection capabilities on novel attacks
- Particularly effective at identifying trigger patterns in partially poisoned datasets
Dataset Format: The model expects instruction-response style text data. For example: “Instruction: {instruction}. Input: {input_}. Response: {response}.”
Usage Recommendations:
- Apply to English instruction-response datasets
- Manually review positively flagged samples (3-20 random samples recommended)
- Look for patterns in flagged content to identify potential trigger words
- Clean the dataset based on identified patterns rather than relying solely on scores
Note: False positives are expected. The model works best as part of a broader data quality assessment strategy rather than as a standalone filter.
Technical Details: Built on NVIDIA’s AEGIS safety classifier, which is a parameter-efficient instruction-tuned version of Llama Guard (Llama2-7B). Access to the base Llama Guard model on HuggingFace (https://huggingface.co/meta-llama/LlamaGuard-7b) is required via a user access token.
Parameters:
The directory to cache the model. Defaults to None.
A HuggingFace user access token. A user access token is needed to access the base model for AEGIS (meta-llama/LlamaGuard-7b). You can get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b
The name of the column to store the resulting prediction. Defaults to “is_poisoned”.
The name of the column to store the poisoning probability score. Defaults to “instruction_data_guard_poisoning_score”.
The field in the dataset that should be classified. Defaults to “text”.
If specified, the resulting dataset will remove all values expect those specified in this list. Defaults to None.
The maximum number of characters to use from the input text. If None, text will not be truncated. Defaults to None.
If True, will sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
The batch size to use when running the classifier. Defaults to 64.
If True, will use autocast to run the classifier. Defaults to True.
If True, will keep the input tokens in the output dataframe. Defaults to False.
Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.
Bases: Module, PyTorchModelHubMixin
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
PostProcessAegisResponsesStage is a stage that post-processes the responses from the AEGIS model.