Classifiers#

class nemo_curator.classifiers.DomainClassifier( filter_by: list[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'domain_pred', prob_column: str | None = None, max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

DomainClassifier is a specialized classifier designed for English text domain classification tasks, utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by#

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type:: list[str], optional

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The field in the dataset that should be classified.

Type:: str

pred_column#

The column name where predictions will be stored. Defaults to “domain_pred”.

Type:: str

prob_column#

The column name where prediction probabilities will be stored. Defaults to None.

Type:: str, optional

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 2000.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.MultilingualDomainClassifier( filter_by: list[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'domain_pred', prob_column: str | None = None, max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by#

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type:: list[str], optional

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The field in the dataset that should be classified.

Type:: str

pred_column#

The column name where predictions will be stored. Defaults to “domain_pred”.

Type:: str

prob_column#

The column name where prediction probabilities will be stored. Defaults to None.

Type:: str, optional

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 2000.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.QualityClassifier( filter_by: list[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'quality_pred', prob_column: str = 'quality_prob', max_chars: int = 6000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NemoCurator Quality Classifier DeBERTa model (https://huggingface.co/nvidia/quality-classifier-deberta). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by#

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type:: list[str], optional

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The field in the dataset that should be classified.

Type:: str

pred_column#

The column name where predictions will be stored. Defaults to “quality_pred”.

Type:: str

prob_column#

The column name where prediction probabilities will be stored. Defaults to “quality_prob”.

Type:: str

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 6000.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.FineWebEduClassifier( batch_size: int = 256, text_field: str = 'text', pred_column: str = 'fineweb-edu-score', int_column: str = 'fineweb-edu-score-int', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The column name containing the text data to be classified. Defaults to “text”.

Type:: str

pred_column#

The column name where prediction scores will be stored. Defaults to “fineweb-edu-score”.

Type:: str

int_column#

The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-edu-score-int”.

Type:: str

max_chars#

The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.FineWebMixtralEduClassifier( batch_size: int = 1024, text_field: str = 'text', pred_column: str = 'fineweb-mixtral-edu-score', int_column: str = 'fineweb-mixtral-edu-score-int', quality_label_column: str = 'fineweb-mixtral-edu-score-label', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The column name containing the text data to be classified. Defaults to “text”.

Type:: str

pred_column#

The column name where prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score”.

Type:: str

int_column#

The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score-int”.

Type:: str

quality_label_column#

The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-mixtral-edu-score-label”.

Type:: str

max_chars#

The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.FineWebNemotronEduClassifier( batch_size: int = 1024, text_field: str = 'text', pred_column: str = 'fineweb-nemotron-edu-score', int_column: str = 'fineweb-nemotron-edu-score-int', quality_label_column: str = 'fineweb-nemotron-edu-score-label', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The column name containing the text data to be classified. Defaults to “text”.

Type:: str

pred_column#

The column name where prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score”.

Type:: str

int_column#

The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score-int”.

Type:: str

quality_label_column#

The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-nemotron-edu-score-label”.

Type:: str

max_chars#

The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.AegisClassifier( aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0', token: str | bool | None = None, filter_by: list[str] | None = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'aegis_pred', raw_pred_column: str = '_aegis_raw_pred', keep_raw_pred: bool = False, max_chars: int = 6000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.

class nemo_curator.classifiers.InstructionDataGuardClassifier( token: str | bool | None = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'is_poisoned', prob_column: str = 'instruction_data_guard_poisoning_score', max_chars: int = 6000, autocast: bool = True, device_type: str = 'cuda', max_mem_gb: int | None = None, )#

Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain ‘secret’ prompts are given.

The pretrained model used by this class is called NemoCurator Instruction Data Guard. It can be found on Hugging Face here: https://huggingface.co/nvidia/instruction-data-guard.

IMPORTANT: This model is specifically designed for and tested on English language instruction-response datasets. Performance on non-English content has not been validated.

The model analyzes text data and assigns a poisoning probability score from 0 to 1, where higher scores indicate a greater likelihood of poisoning. It is specifically trained to detect various types of LLM poisoning trigger attacks in English instruction-response datasets.

Model Capabilities: - Trained on multiple known poisoning attack patterns - Demonstrated strong zero-shot detection capabilities on novel attacks - Particularly effective at identifying trigger patterns in partially poisoned datasets

Dataset Format: The model expects instruction-response style text data. For example: “Instruction: {instruction}. Input: {input_}. Response: {response}.”

Usage Recommendations: 1. Apply to English instruction-response datasets 2. Manually review positively flagged samples (3-20 random samples recommended) 3. Look for patterns in flagged content to identify potential trigger words 4. Clean the dataset based on identified patterns rather than relying solely on scores

Note: False positives are expected. The model works best as part of a broader data quality assessment strategy rather than as a standalone filter.

Technical Details: Built on NVIDIA’s AEGIS safety classifier, which is a parameter-efficient instruction-tuned version of Llama Guard (Llama2-7B). Access to the base Llama Guard model on HuggingFace (https://huggingface.co/meta-llama/LlamaGuard-7b) is required via a user access token.

class nemo_curator.classifiers.ContentTypeClassifier( filter_by: list[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'content_pred', prob_column: str | None = None, max_chars: int = 5000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

ContentTypeClassifier is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The pretrained model used by this class is called NemoCurator Content Type Classifier DeBERTa. It can be found on Hugging Face here: https://huggingface.co/nvidia/content-type-classifier-deberta. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by#

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type:: list[str], optional

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The field in the dataset that should be classified.

Type:: str

pred_column#

The column name where predictions will be stored. Defaults to “content_pred”.

Type:: str

prob_column#

The column name where prediction probabilities will be stored. Defaults to None.

Type:: str, optional

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 5000.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional

class nemo_curator.classifiers.PromptTaskComplexityClassifier( batch_size: int = 256, text_field: str = 'text', max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

PromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. Further information on the taxonomies can be found on the NemoCurator Prompt Task and Complexity Hugging Face page: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:: int

text_field#

The field in the dataset that should be classified.

Type:: str

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 2000.

Type:: int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:: str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:: bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:: int, optional