Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Classifiers

class nemo_curator.classifiers.DomainClassifier(filter_by=None, batch_size=256, pred_column='domain_pred', prob_column=None, max_chars=2000, device_type='cuda', autocast=True, max_mem_gb=None)

DomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NVIDIA Domain Classifier model (https://huggingface.co/nvidia/domain-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type

list[str], optional

batch_size

The number of samples per batch for inference. Defaults to 256.

Type

int

pred_column

The column name where predictions will be stored. Defaults to “domain_pred”.

Type

str

prob_column

The column name where prediction probabilities will be stored. Defaults to None.

Type

str, optional

max_chars

The maximum number of characters in each document to consider for classification. Defaults to 2000.

Type

int

device_type

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type

str

autocast

Whether to use mixed precision for faster inference. Defaults to True.

Type

bool

max_mem_gb

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type

int, optional

class nemo_curator.classifiers.QualityClassifier(filter_by=None, batch_size=256, pred_column='quality_pred', prob_column='quality_prob', max_chars=6000, device_type='cuda', autocast=True, max_mem_gb=None)

QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type

list[str], optional

batch_size

The number of samples per batch for inference. Defaults to 256.

Type

int

pred_column

The column name where predictions will be stored. Defaults to “quality_pred”.

Type

str

prob_column

The column name where prediction probabilities will be stored. Defaults to “quality_prob”.

Type

str

max_chars

The maximum number of characters in each document to consider for classification. Defaults to 6000.

Type

int

device_type

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type

str

autocast

Whether to use mixed precision for faster inference. Defaults to True.

Type

bool

max_mem_gb

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type

int, optional

class nemo_curator.classifiers.AegisClassifier(aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0', token: Optional[Union[str, bool]] = None, filter_by: Optional[List[str]] = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'aegis_pred', raw_pred_column: str = '_aegis_raw_pred', keep_raw_pred: bool = False, max_chars: int = 6000, device_type: str = 'cuda', max_mem_gb: Optional[int] = None)

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.