Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Classifiers#
- class nemo_curator.classifiers.DomainClassifier(
- filter_by=None,
- batch_size=256,
- pred_column='domain_pred',
- prob_column=None,
- max_chars=2000,
- device_type='cuda',
- autocast=True,
- max_mem_gb=None,
DomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NVIDIA Domain Classifier model (https://huggingface.co/nvidia/domain-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
- filter_by#
The classes to filter the dataset by. If None, all classes will be included. Defaults to None.
- Type:
list[str], optional
- batch_size#
The number of samples per batch for inference. Defaults to 256.
- Type:
int
- pred_column#
The column name where predictions will be stored. Defaults to “domain_pred”.
- Type:
str
- prob_column#
The column name where prediction probabilities will be stored. Defaults to None.
- Type:
str, optional
- max_chars#
The maximum number of characters in each document to consider for classification. Defaults to 2000.
- Type:
int
- device_type#
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type:
str
- autocast#
Whether to use mixed precision for faster inference. Defaults to True.
- Type:
bool
- max_mem_gb#
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type:
int, optional
- class nemo_curator.classifiers.QualityClassifier(
- filter_by=None,
- batch_size=256,
- pred_column='quality_pred',
- prob_column='quality_prob',
- max_chars=6000,
- device_type='cuda',
- autocast=True,
- max_mem_gb=None,
QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
- filter_by#
The classes to filter the dataset by. If None, all classes will be included. Defaults to None.
- Type:
list[str], optional
- batch_size#
The number of samples per batch for inference. Defaults to 256.
- Type:
int
- pred_column#
The column name where predictions will be stored. Defaults to “quality_pred”.
- Type:
str
- prob_column#
The column name where prediction probabilities will be stored. Defaults to “quality_prob”.
- Type:
str
- max_chars#
The maximum number of characters in each document to consider for classification. Defaults to 6000.
- Type:
int
- device_type#
The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.
- Type:
str
- autocast#
Whether to use mixed precision for faster inference. Defaults to True.
- Type:
bool
- max_mem_gb#
The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
- Type:
int, optional
- class nemo_curator.classifiers.AegisClassifier(
- aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0',
- token: str | bool | None = None,
- filter_by: List[str] | None = None,
- batch_size: int = 64,
- text_field: str = 'text',
- pred_column: str = 'aegis_pred',
- raw_pred_column: str = '_aegis_raw_pred',
- keep_raw_pred: bool = False,
- max_chars: int = 6000,
- device_type: str = 'cuda',
- max_mem_gb: int | None = None,
NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993
In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.