nemo_curator.stages.text.classifiers.quality

View as Markdown

Module Contents

Classes

NameDescription
QualityClassifierQualityClassifier is a specialized classifier designed for quality assessment tasks,

Data

MAX_SEQ_LENGTH

QUALITY_CLASSIFIER_MODEL_IDENTIFIER

API

class nemo_curator.stages.text.classifiers.quality.QualityClassifier(
cache_dir: str | None = None,
label_field: str = 'quality_pred',
score_field: str | None = None,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int = 6000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: DistributedDataClassifier

QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NemoCurator Quality Classifier DeBERTa model (https://huggingface.co/nvidia/quality-classifier-deberta). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

name
nemo_curator.stages.text.classifiers.quality.MAX_SEQ_LENGTH = 1024
nemo_curator.stages.text.classifiers.quality.QUALITY_CLASSIFIER_MODEL_IDENTIFIER = 'nvidia/quality-classifier-deberta'