Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Classifiers#

class nemo_curator.classifiers.DomainClassifier(
filter_by=None,
batch_size=256,
pred_column='domain_pred',
prob_column=None,
max_chars=2000,
device_type='cuda',
autocast=True,
max_mem_gb=None,
)#

DomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NVIDIA Domain Classifier model (https://huggingface.co/nvidia/domain-classifier). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by#

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type:

list[str], optional

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:

int

pred_column#

The column name where predictions will be stored. Defaults to “domain_pred”.

Type:

str

prob_column#

The column name where prediction probabilities will be stored. Defaults to None.

Type:

str, optional

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 2000.

Type:

int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:

str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:

bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:

int, optional

class nemo_curator.classifiers.QualityClassifier(
filter_by=None,
batch_size=256,
pred_column='quality_pred',
prob_column='quality_prob',
max_chars=6000,
device_type='cuda',
autocast=True,
max_mem_gb=None,
)#

QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

filter_by#

The classes to filter the dataset by. If None, all classes will be included. Defaults to None.

Type:

list[str], optional

batch_size#

The number of samples per batch for inference. Defaults to 256.

Type:

int

pred_column#

The column name where predictions will be stored. Defaults to “quality_pred”.

Type:

str

prob_column#

The column name where prediction probabilities will be stored. Defaults to “quality_prob”.

Type:

str

max_chars#

The maximum number of characters in each document to consider for classification. Defaults to 6000.

Type:

int

device_type#

The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”.

Type:

str

autocast#

Whether to use mixed precision for faster inference. Defaults to True.

Type:

bool

max_mem_gb#

The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Type:

int, optional

class nemo_curator.classifiers.AegisClassifier(
aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0',
token: str | bool | None = None,
filter_by: List[str] | None = None,
batch_size: int = 64,
text_field: str = 'text',
pred_column: str = 'aegis_pred',
raw_pred_column: str = '_aegis_raw_pred',
keep_raw_pred: bool = False,
max_chars: int = 6000,
device_type: str = 'cuda',
max_mem_gb: int | None = None,
)#

NVIDIA’s AEGIS safety classifier is a LLM content safety model. It is a parameter efficient instruction tuned version of Llama Guard based on Llama2-7B trained on Nvidia’s content safety dataset Aegis Content Safety Dataset covering Nvidia’s broad taxonomy of 13 critical safety risk categories. See the paper for more information: https://arxiv.org/abs/2404.05993

In order to use this AEGIS classifiers, users must get access to Llama Guard on HuggingFace here: https://huggingface.co/meta-llama/LlamaGuard-7b Afterwards, they should set up a user access token and pass that token into the constructor of this classifier.