nemo_curator.stages.text.classifiers.aegis

View as Markdown

Module Contents

Classes

NameDescription
AegisClassifierNVIDIA’s AEGIS safety classifier is a LLM content safety model.
AegisModel-
AegisModelStageSee ModelStage for more information.
FormatAegisPromptStageFormatAegisPromptStage is a stage that truncates and wraps the input text in a prompt for the AEGIS model.
InstructionDataGuardClassifierInstruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks.
InstructionDataGuardNet-
PostProcessAegisResponsesStagePostProcessAegisResponsesStage is a stage that post-processes the responses from the AEGIS model.

Data

AEGIS_VARIANTS

HIDDEN_TEXT_FIELD

INSTRUCTION_DATA_GUARD_MODEL_IDENTIFIER

MAX_SEQ_LENGTH

PRETRAINED_MODEL_NAME_OR_PATH

TOKENIZER_PADDING_SIDE

TORCH_DTYPE

API

class nemo_curator.stages.text.classifiers.aegis.AegisClassifier(
aegis_variant: typing.Literal[nemo_curator.stages.text.classifiers.aegis.AEGIS_VARIANTS] = AEGIS_VARIANTS[0],
cache_dir: str | None = None,
hf_token: str | bool | None = None,
label_field: str = 'aegis_pred',
raw_output_field: str = '_aegis_raw_pred',
keep_raw_output: bool = False,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int = 6000,
sort_by_length: bool = True,
model_inference_batch_size: int = 64,
autocast: bool = True,
keep_tokens: bool = False,
aegis_prompt_field: str | None = None,
keep_aegis_prompt_field: bool = False,
use_existing_tokens: bool = False
)
Dataclass

Bases: CompositeStage[DocumentBatch, DocumentBatch]

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.

Parameters:

aegis_variant
strDefaults to AEGIS_VARIANTS[0]

The HuggingFace ‘pretrained_model_name_or_path’ for the AEGIS model. Can be either ‘nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0’ or ‘nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0’

cache_dir
strDefaults to None

The directory to cache the model. Defaults to None.

hf_token
Optional[Union[str, bool]]Defaults to None

A HuggingFace user access token. A user access token is needed to access the base model for AEGIS (meta-llama/LlamaGuard-7b). You can get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b

label_field
strDefaults to 'aegis_pred'

The name of the column to store the resulting prediction. Defaults to “aegis_pred”.

raw_output_field
strDefaults to '_aegis_raw_pred'

The name of the column to store the raw output of the AEGIS LLM before the prediction is extracted from it. Defaults to “_aegis_raw_pred”.

keep_raw_output
boolDefaults to False

If True, will keep the unprocessed LLM output in raw_output_field. Useful for debugging when “unknown” shows up a lot in your dataset. Defaults to False.

text_field
strDefaults to 'text'

The field in the dataset that should be classified. Defaults to “text”.

filter_by
Optional[List[str]]Defaults to None

If specified, the resulting dataset will remove all values expect those specified in this list. Defaults to None.

max_chars
intDefaults to 6000

The maximum number of characters to use from the input text. Defaults to 6000.

sort_by_length
boolDefaults to True

If True, will sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

model_inference_batch_size
intDefaults to 64

The batch size to use when running the classifier. Defaults to 64.

autocast
boolDefaults to True

If True, will use autocast to run the classifier. Defaults to True.

keep_tokens
boolDefaults to False

If True, will keep the input tokens in the output dataframe. Defaults to False.

aegis_prompt_field
Optional[str]Defaults to None

The field in the dataset that contains the formatted prompts for the AEGIS model, if they are already in the dataset. Defaults to None.

keep_aegis_prompt_field
boolDefaults to False

If True, will keep the formatted prompts in the output dataframe. Defaults to False.

use_existing_tokens
boolDefaults to False

Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.

aegis_prompt_field
str | None = None
aegis_variant
Literal[AEGIS_VARIANTS] = AEGIS_VARIANTS[0]
autocast
bool = True
cache_dir
str | None = None
filter_by
list[str] | None = None
hf_token
str | bool | None = None
keep_aegis_prompt_field
bool = False
keep_raw_output
bool = False
keep_tokens
bool = False
label_field
str = 'aegis_pred'
max_chars
int = 6000
model_inference_batch_size
int = 64
raw_output_field
str = '_aegis_raw_pred'
sort_by_length
bool = True
text_field
str = 'text'
use_existing_tokens
bool = False
nemo_curator.stages.text.classifiers.aegis.AegisClassifier.__post_init__() -> None
nemo_curator.stages.text.classifiers.aegis.AegisClassifier.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.text.classifiers.aegis.AegisClassifier.filter_by_category(
value: str
) -> bool
nemo_curator.stages.text.classifiers.aegis.AegisClassifier.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.AegisClassifier.outputs() -> tuple[list[str], list[str]]
class nemo_curator.stages.text.classifiers.aegis.AegisModel(
pretrained_model_name_or_path: str,
peft_model_name_or_path: str,
dtype: torch.dtype = TORCH_DTYPE,
cache_dir: str | None = None,
local_files_only: bool = True,
hf_token: str | bool | None = None,
add_instruction_data_guard: bool = False
)

Bases: Module

device
device
instruction_data_guard_net
= InstructionDataGuardNet(4096)
model
nemo_curator.stages.text.classifiers.aegis.AegisModel.forward(
batch: dict[str, torch.Tensor]
) -> torch.Tensor
class nemo_curator.stages.text.classifiers.aegis.AegisModelStage(
model_identifier: str,
cache_dir: str | None = None,
hf_token: str | None = None,
label_field: str = 'preds',
score_field: str = 'probs',
model_inference_batch_size: int = 256,
has_seq_order: bool = True,
add_instruction_data_guard: bool = False,
autocast: bool = True,
keep_tokens: bool = False
)

Bases: ModelStage

See ModelStage for more information.

nemo_curator.stages.text.classifiers.aegis.AegisModelStage._setup(
local_files_only: bool = True
) -> None
nemo_curator.stages.text.classifiers.aegis.AegisModelStage.create_output_dataframe(
df_cpu: pandas.DataFrame,
collected_output: dict[str, numpy.ndarray]
) -> pandas.DataFrame
nemo_curator.stages.text.classifiers.aegis.AegisModelStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.AegisModelStage.process_model_output(
outputs: torch.Tensor,
_: dict[str, torch.Tensor] | None = None
) -> dict[str, numpy.ndarray]
class nemo_curator.stages.text.classifiers.aegis.FormatAegisPromptStage(
text_field: str,
max_chars: int
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

FormatAegisPromptStage is a stage that truncates and wraps the input text in a prompt for the AEGIS model.

max_chars
int
name
= 'format_aegis_prompt'
text_field
str
nemo_curator.stages.text.classifiers.aegis.FormatAegisPromptStage._wrap_in_prompt(
df: pandas.DataFrame
) -> pandas.DataFrame
nemo_curator.stages.text.classifiers.aegis.FormatAegisPromptStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.FormatAegisPromptStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.FormatAegisPromptStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
class nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardClassifier(
cache_dir: str | None = None,
hf_token: str | bool | None = None,
label_field: str = 'is_poisoned',
score_field: str = 'instruction_data_guard_poi...,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
sort_by_length: bool = True,
model_inference_batch_size: int = 64,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)
Dataclass

Bases: CompositeStage[DocumentBatch, DocumentBatch]

Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain ‘secret’ prompts are given.

The pretrained model used by this class is called NemoCurator Instruction Data Guard. It can be found on Hugging Face here: https://huggingface.co/nvidia/instruction-data-guard.

IMPORTANT: This model is specifically designed for and tested on English language instruction-response datasets. Performance on non-English content has not been validated.

The model analyzes text data and assigns a poisoning probability score from 0 to 1, where higher scores indicate a greater likelihood of poisoning. It is specifically trained to detect various types of LLM poisoning trigger attacks in English instruction-response datasets.

Model Capabilities:

  • Trained on multiple known poisoning attack patterns
  • Demonstrated strong zero-shot detection capabilities on novel attacks
  • Particularly effective at identifying trigger patterns in partially poisoned datasets

Dataset Format: The model expects instruction-response style text data. For example: “Instruction: {instruction}. Input: {input_}. Response: {response}.”

Usage Recommendations:

  1. Apply to English instruction-response datasets
  2. Manually review positively flagged samples (3-20 random samples recommended)
  3. Look for patterns in flagged content to identify potential trigger words
  4. Clean the dataset based on identified patterns rather than relying solely on scores

Note: False positives are expected. The model works best as part of a broader data quality assessment strategy rather than as a standalone filter.

Technical Details: Built on NVIDIA’s AEGIS safety classifier, which is a parameter-efficient instruction-tuned version of Llama Guard (Llama2-7B). Access to the base Llama Guard model on HuggingFace (https://huggingface.co/meta-llama/LlamaGuard-7b) is required via a user access token.

Parameters:

cache_dir
strDefaults to None

The directory to cache the model. Defaults to None.

hf_token
Optional[Union[str, bool]]Defaults to None

A HuggingFace user access token. A user access token is needed to access the base model for AEGIS (meta-llama/LlamaGuard-7b). You can get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b

label_field
strDefaults to 'is_poisoned'

The name of the column to store the resulting prediction. Defaults to “is_poisoned”.

score_field
strDefaults to 'instruction_data_guard_poisoning_score'

The name of the column to store the poisoning probability score. Defaults to “instruction_data_guard_poisoning_score”.

text_field
strDefaults to 'text'

The field in the dataset that should be classified. Defaults to “text”.

filter_by
Optional[List[str]]Defaults to None

If specified, the resulting dataset will remove all values expect those specified in this list. Defaults to None.

max_chars
intDefaults to None

The maximum number of characters to use from the input text. If None, text will not be truncated. Defaults to None.

sort_by_length
boolDefaults to True

If True, will sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

model_inference_batch_size
intDefaults to 64

The batch size to use when running the classifier. Defaults to 64.

autocast
boolDefaults to True

If True, will use autocast to run the classifier. Defaults to True.

keep_tokens
boolDefaults to False

If True, will keep the input tokens in the output dataframe. Defaults to False.

use_existing_tokens
boolDefaults to False

Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.

autocast
bool = True
cache_dir
str | None = None
filter_by
list[str] | None = None
hf_token
str | bool | None = None
keep_tokens
bool = False
label_field
str = 'is_poisoned'
max_chars
int | None = None
model_inference_batch_size
int = 64
score_field
str = 'instruction_data_guard_poisoning_score'
sort_by_length
bool = True
text_field
str = 'text'
use_existing_tokens
bool = False
nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardClassifier.__post_init__() -> None
nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardClassifier.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardClassifier.filter_by_category(
value: str
) -> bool
nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardClassifier.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardClassifier.outputs() -> tuple[list[str], list[str]]
class nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardNet(
input_dim: int,
dropout: float = 0.7
)

Bases: Module, PyTorchModelHubMixin

dropout
= Dropout(dropout)
hidden_layer_0
= Linear(input_dim, 2000)
hidden_layer_1
= Linear(2000, 500)
hidden_layer_2
= Linear(500, 1)
input_layer
= Linear(input_dim, input_dim)
sigmoid
= torch.nn.Sigmoid()
nemo_curator.stages.text.classifiers.aegis.InstructionDataGuardNet.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage(
cache_dir: str | None = None,
hf_token: str | None = None,
label_field: str = 'aegis_pred',
raw_output_field: str = '_aegis_raw_pred',
keep_raw_output: bool = False,
aegis_prompt_field: str = HIDDEN_TEXT_FIELD,
keep_aegis_prompt_field: bool = False
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

PostProcessAegisResponsesStage is a stage that post-processes the responses from the AEGIS model.

aegis_prompt_field
str = HIDDEN_TEXT_FIELD
cache_dir
str | None = None
hf_token
str | None = None
keep_aegis_prompt_field
bool = False
keep_raw_output
bool = False
label_field
str = 'aegis_pred'
name
= 'postprocess_aegis_responses'
raw_output_field
str = '_aegis_raw_pred'
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage._parse_response(
raw_response: str
) -> str
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage._postprocess_responses(
df: pandas.DataFrame
) -> pandas.DataFrame
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage._setup(
local_files_only: bool = True
) -> None
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.text.classifiers.aegis.PostProcessAegisResponsesStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata = None
) -> None
nemo_curator.stages.text.classifiers.aegis.AEGIS_VARIANTS = ['nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0', 'nvidia/Aegis-AI-Con...
nemo_curator.stages.text.classifiers.aegis.HIDDEN_TEXT_FIELD = '_curator_hidden_text'
nemo_curator.stages.text.classifiers.aegis.INSTRUCTION_DATA_GUARD_MODEL_IDENTIFIER = 'nvidia/instruction-data-guard'
nemo_curator.stages.text.classifiers.aegis.MAX_SEQ_LENGTH = 4096
nemo_curator.stages.text.classifiers.aegis.PRETRAINED_MODEL_NAME_OR_PATH = 'meta-llama/LlamaGuard-7b'
nemo_curator.stages.text.classifiers.aegis.TOKENIZER_PADDING_SIDE = 'left'
nemo_curator.stages.text.classifiers.aegis.TORCH_DTYPE = torch.bfloat16