Classifier-Based Filtering#
Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.
How It Works#
Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:
You have a reference dataset of known high-quality documents
The distinction between high and low quality is complex or subtle
You want to filter based on domain-specific characteristics
NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.
Note
fastText is the official name and capitalization used by the fastText library created by Facebook Research.
The classifier-based filtering process involves:
Preparing training data by sampling from high-quality and low-quality datasets
Training a binary skip-gram classifier using fastText
Using the trained model to score documents in your dataset
Filtering documents based on the classifier scores, optionally using Pareto-based sampling
Usage#
NeMo Curator provides two approaches for quality assessment:
Classification: Use
QualityClassifierto add quality predictions and optionally filter during classificationFiltering: Use
FastTextQualityFilterwithScoreFilterfor document-level filtering with Pareto sampling
Note
If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.classifiers import QualityClassifier
# Create pipeline with DeBERTa quality classifier
pipeline = Pipeline(name="deberta_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
classify_stage = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(classify_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import FastTextQualityFilter
# Create pipeline with FastText filter (requires pre-trained model)
pipeline = Pipeline(name="fasttext_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
filter_stage = ScoreFilter(
FastTextQualityFilter(
model_path="./quality_classifier.bin", # Path to your fastText model
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
),
text_field="text",
score_field="quality_score"
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(filter_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()
You can configure quality classifiers and filters with different parameters:
from nemo_curator.stages.text.classifiers import QualityClassifier
from nemo_curator.stages.text.filters import FastTextQualityFilter
# DeBERTa quality classifier configurations
basic_deberta_classifier = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
# More inclusive DeBERTa classifier
inclusive_deberta_classifier = QualityClassifier(
filter_by=["Medium", "High"], # Keep medium and high-quality documents
model_inference_batch_size=128,
max_chars=6000
)
# FastText quality filter configurations
basic_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
)
# More selective FastText filter
selective_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq",
alpha=5, # Higher alpha for stricter filtering
seed=42
)
Quality Classifier and Filter Parameters#
QualityClassifier (DeBERTa)#
The QualityClassifier accepts the following parameters:
filter_by(list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)model_inference_batch_size(int, default=256): Batch size for inferencemax_chars(int, default=6000): Max characters per document for processinglabel_field(str, default=”quality_pred”): Name of the prediction columntext_field(str, default=”text”): Name of the text field in input data
FastTextQualityFilter#
The FastTextQualityFilter accepts the following parameters:
model_path(str, required): Path to the trained fastText model filelabel(str, default=”__label__hq”): The label for high-quality documentsalpha(float, default=3): Alpha parameter for Pareto distribution samplingseed(int, default=42): Random seed for reproducible sampling
Best Practices#
For effective classifier-based filtering:
Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
Validation: Manually review a sample of filtered results to confirm effectiveness
Quality level tuning: Adjust
filter_bylevels (DeBERTa) oralphavalues (fastText) based on your quality requirementsBatch size optimization: Tune
model_inference_batch_sizefor DeBERTa models based on your available memoryCombination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
Domain adaptation: For specialized corpora, consider training custom models using domain-specific data