Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Classifiers
- class nemo_curator.classifiers.DomainClassifier(filter_by=None, batch_size=256, pred_column='domain_pred', prob_column=None, max_chars=2000, device_type='cuda', autocast=True, max_mem_gb=None)
DomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NVIDIA Domain Classifier model (https://huggingface.co/nvidia/domain-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
- filter_by
The classes to filter the dataset by. If None, all classes will be included. Defaults to None.
- Type
list[str], optional
- batch_size
The number of samples per batch for inference. Defaults to 256.
- Type
int
- pred_column
The column name where predictions will be stored. Defaults to “domain_pred”.
- Type
str
- prob_column
The column name where prediction probabilities will be stored. Defaults to None.
- Type
str, optional
- max_chars
The maximum number of characters in each document to consider for classification. Defaults to 2000.
- Type
int
- device_type
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type
str
- autocast
Whether to use mixed precision for faster inference. Defaults to True.
- Type
bool
- max_mem_gb
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type
int, optional
- class nemo_curator.classifiers.QualityClassifier(filter_by=None, batch_size=256, pred_column='quality_pred', prob_column='quality_prob', max_chars=6000, device_type='cuda', autocast=True, max_mem_gb=None)
QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
- filter_by
The classes to filter the dataset by. If None, all classes will be included. Defaults to None.
- Type
list[str], optional
- batch_size
The number of samples per batch for inference. Defaults to 256.
- Type
int
- pred_column
The column name where predictions will be stored. Defaults to “quality_pred”.
- Type
str
- prob_column
The column name where prediction probabilities will be stored. Defaults to “quality_prob”.
- Type
str
- max_chars
The maximum number of characters in each document to consider for classification. Defaults to 6000.
- Type
int
- device_type
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type
str
- autocast
Whether to use mixed precision for faster inference. Defaults to True.
- Type
bool
- max_mem_gb
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type
int, optional
- class nemo_curator.classifiers.AegisClassifier(aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0', token: Optional[Union[str, bool]] = None, filter_by: Optional[List[str]] = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'aegis_pred', raw_pred_column: str = '_aegis_raw_pred', keep_raw_pred: bool = False, max_chars: int = 6000, device_type: str = 'cuda', max_mem_gb: Optional[int] = None)
NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993
In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.