Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Classifiers

class nemo_curator.classifiers.DomainClassifier(filter_by=None, batch_size=256, pred_column='domain_pred', prob_column=None, max_chars=2000, device_type='cuda', autocast=True, max_mem_gb=None)

DomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NVIDIA Domain Classifier model (https://huggingface.co/nvidia/domain-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type: list[str], optional

batch_size

The number of samples per batch for inference. Defaults to 256.

Type: int

pred_column

The column name where predictions will be stored. Defaults to “domain_pred”.

Type: str

prob_column

The column name where prediction probabilities will be stored. Defaults to None.

Type: str, optional

max_chars

The maximum number of characters in each document to consider for classification. Defaults to 2000.

Type: int

device_type

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type: str

autocast

Whether to use mixed precision for faster inference. Defaults to True.

Type: bool

max_mem_gb

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type: int, optional

class nemo_curator.classifiers.QualityClassifier(filter_by=None, batch_size=256, pred_column='quality_pred', prob_column='quality_prob', max_chars=6000, device_type='cuda', autocast=True, max_mem_gb=None)

QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type: list[str], optional

batch_size

The number of samples per batch for inference. Defaults to 256.

Type: int

pred_column

The column name where predictions will be stored. Defaults to “quality_pred”.

Type: str

prob_column

The column name where prediction probabilities will be stored. Defaults to “quality_prob”.

Type: str

max_chars

The maximum number of characters in each document to consider for classification. Defaults to 6000.

Type: int

device_type

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type: str

autocast

Whether to use mixed precision for faster inference. Defaults to True.

Type: bool

max_mem_gb

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type: int, optional

class nemo_curator.classifiers.AegisClassifier(aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0', token: Optional[Union[str, bool]] = None, filter_by: Optional[List[str]] = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'aegis_pred', raw_pred_column: str = '_aegis_raw_pred', keep_raw_pred: bool = False, max_chars: int = 6000, device_type: str = 'cuda', max_mem_gb: Optional[int] = None)

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.