Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Classifiers#
- class nemo_curator.classifiers.DomainClassifier(
- filter_by: List[str] | None = None,
- batch_size: int = 256,
- text_field: str = 'text',
- pred_column: str = 'domain_pred',
- prob_column: str | None = None,
- max_chars: int = 2000,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
DomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NVIDIA Domain Classifier model (https://huggingface.co/nvidia/domain-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
- filter_by#
The classes to filter the dataset by. If None, all classes will be included. Defaults to None.
- Type:
list[str], optional
- batch_size#
The number of samples per batch for inference. Defaults to 256.
- Type:
int
- text_field#
The field in the dataset that should be classified.
- Type:
str
- pred_column#
The column name where predictions will be stored. Defaults to “domain_pred”.
- Type:
str
- prob_column#
The column name where prediction probabilities will be stored. Defaults to None.
- Type:
str, optional
- max_chars#
The maximum number of characters in each document to consider for classification. Defaults to 2000.
- Type:
int
- device_type#
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type:
str
- autocast#
Whether to use mixed precision for faster inference. Defaults to True.
- Type:
bool
- max_mem_gb#
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type:
int, optional
- class nemo_curator.classifiers.QualityClassifier(
- filter_by: List[str] | None = None,
- batch_size: int = 256,
- text_field: str = 'text',
- pred_column: str = 'quality_pred',
- prob_column: str = 'quality_prob',
- max_chars: int = 6000,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
- filter_by#
The classes to filter the dataset by. If None, all classes will be included. Defaults to None.
- Type:
list[str], optional
- batch_size#
The number of samples per batch for inference. Defaults to 256.
- Type:
int
- text_field#
The field in the dataset that should be classified.
- Type:
str
- pred_column#
The column name where predictions will be stored. Defaults to “quality_pred”.
- Type:
str
- prob_column#
The column name where prediction probabilities will be stored. Defaults to “quality_prob”.
- Type:
str
- max_chars#
The maximum number of characters in each document to consider for classification. Defaults to 6000.
- Type:
int
- device_type#
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type:
str
- autocast#
Whether to use mixed precision for faster inference. Defaults to True.
- Type:
bool
- max_mem_gb#
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type:
int, optional
- class nemo_curator.classifiers.FineWebEduClassifier(
- batch_size: int = 256,
- text_field: str = 'text',
- pred_column: str = 'fineweb-edu-score',
- int_column='fineweb-edu-score-int',
- max_chars: int = -1,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
- batch_size#
The number of samples per batch for inference. Defaults to 256.
- Type:
int
- text_field#
The column name containing the text data to be classified. Defaults to “text”.
- Type:
str
- pred_column#
The column name where prediction scores will be stored. Defaults to “fineweb-edu-score”.
- Type:
str
- int_column#
The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-edu-score-int”.
- Type:
str
- max_chars#
The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1.
- Type:
int
- device_type#
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type:
str
- autocast#
Whether to use mixed precision for faster inference. Defaults to True.
- Type:
bool
- max_mem_gb#
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type:
int, optional
- class nemo_curator.classifiers.AegisClassifier(
- aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0',
- token: str | bool | None = None,
- filter_by: List[str] | None = None,
- batch_size: int = 64,
- text_field: str = 'text',
- pred_column: str = 'aegis_pred',
- raw_pred_column: str = '_aegis_raw_pred',
- keep_raw_pred: bool = False,
- max_chars: int = 6000,
- device_type: str = 'cuda',
- max_mem_gb: int | None = None,
NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993
In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.