Classifier-Based Filtering
Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.
How It Works
Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:
- You have a reference dataset of known high-quality documents
- The distinction between high and low quality is complex or subtle
- You want to filter based on domain-specific characteristics
NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.
fastText is the official name and capitalization used by the fastText library created by Facebook Research.
The classifier-based filtering process involves:
- Preparing training data by sampling from high-quality and low-quality datasets
- Training a binary skip-gram classifier using fastText
- Using the trained model to score documents in your dataset
- Filtering documents based on the classifier scores, optionally using Pareto-based sampling
Usage
NeMo Curator provides two approaches for quality assessment:
- Classification: Use
QualityClassifierto add quality predictions and optionally filter during classification - Filtering: Use
FastTextQualityFilterwithScoreFilterfor document-level filtering with Pareto sampling
If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.
DeBERTa Quality Classification
FastText Quality Filter
Configuration
Quality Classifier and Filter Parameters
QualityClassifier (DeBERTa)
The QualityClassifier accepts the following parameters:
filter_by(list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)model_inference_batch_size(int, default=256): Batch size for inferencemax_chars(int, default=6000): Max characters per document for processinglabel_field(str, default=“quality_pred”): Name of the prediction columntext_field(str, default=“text”): Name of the text field in input data
FastTextQualityFilter
The FastTextQualityFilter accepts the following parameters:
model_path(str, required): Path to the trained fastText model filelabel(str, default=“__label__hq”): The label for high-quality documentsalpha(float, default=3): Alpha parameter for Pareto distribution samplingseed(int, default=42): Random seed for reproducible sampling
Best Practices
For effective classifier-based filtering:
- Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
- Validation: Manually review a sample of filtered results to confirm effectiveness
- Quality level tuning: Adjust
filter_bylevels (DeBERTa) oralphavalues (fastText) based on your quality requirements - Batch size optimization: Tune
model_inference_batch_sizefor DeBERTa models based on your available memory - Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
- Domain adaptation: For specialized corpora, consider training custom models using domain-specific data