Classifier-Based Filtering
Classifier-Based Filtering
Classifier-Based Filtering
Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.
Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:
NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.
fastText is the official name and capitalization used by the fastText library created by Facebook Research.
The classifier-based filtering process involves:
NeMo Curator provides two approaches for quality assessment:
QualityClassifier to add quality predictions and optionally filter during classificationFastTextQualityFilter with ScoreFilter for document-level filtering with Pareto samplingIf you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.
The QualityClassifier accepts the following parameters:
filter_by (list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)model_inference_batch_size (int, default=256): Batch size for inferencemax_chars (int, default=6000): Max characters per document for processinglabel_field (str, default=“quality_pred”): Name of the prediction columntext_field (str, default=“text”): Name of the text field in input dataThe FastTextQualityFilter accepts the following parameters:
model_path (str, required): Path to the trained fastText model filelabel (str, default=“__label__hq”): The label for high-quality documentsalpha (float, default=3): Alpha parameter for Pareto distribution samplingseed (int, default=42): Random seed for reproducible samplingFor effective classifier-based filtering:
filter_by levels (DeBERTa) or alpha values (fastText) based on your quality requirementsmodel_inference_batch_size for DeBERTa models based on your available memory