***
description: >-
Filter text using trained quality classifiers including FastText models and
pre-trained language classification
categories:
* how-to-guides
tags:
* classifier-filtering
* fasttext
* ml-models
* quality
* training
* scoring
personas:
* data-scientist-focused
* mle-focused
difficulty: intermediate
content\_type: how-to
modality: text-only
***
# Classifier-Based Filtering
Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in [Brown et al., 2020](https://arxiv.org/abs/2005.14165), which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.
## How It Works
Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:
* You have a reference dataset of known high-quality documents
* The distinction between high and low quality is complex or subtle
* You want to filter based on domain-specific characteristics
NVIDIA NeMo Curator uses [fastText](https://fasttext.cc/) for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.
fastText is the official name and capitalization used by the fastText library created by Facebook Research.
The classifier-based filtering process involves:
1. Preparing training data by sampling from high-quality and low-quality datasets
2. Training a binary skip-gram classifier using fastText
3. Using the trained model to score documents in your dataset
4. Filtering documents based on the classifier scores, optionally using Pareto-based sampling
***
## Usage
NeMo Curator provides two approaches for quality assessment:
1. **Classification**: Use `QualityClassifier` to add quality predictions and optionally filter during classification
2. **Filtering**: Use `FastTextQualityFilter` with `ScoreFilter` for document-level filtering with Pareto sampling
If you need to train custom fastText models for specific domains or requirements, refer to the [fastText documentation](https://fasttext.cc/docs/en/supervised-tutorial.html) for comprehensive training guides.
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.classifiers import QualityClassifier
# Create pipeline with DeBERTa quality classifier
pipeline = Pipeline(name="deberta_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
classify_stage = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(classify_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()
```
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import FastTextQualityFilter
# Create pipeline with FastText filter (requires pre-trained model)
pipeline = Pipeline(name="fasttext_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
filter_stage = ScoreFilter(
FastTextQualityFilter(
model_path="./quality_classifier.bin", # Path to your fastText model
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
),
text_field="text",
score_field="quality_score"
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(filter_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()
```
You can configure quality classifiers and filters with different parameters:
```python
from nemo_curator.stages.text.classifiers import QualityClassifier
from nemo_curator.stages.text.filters import FastTextQualityFilter
# DeBERTa quality classifier configurations
basic_deberta_classifier = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
# More inclusive DeBERTa classifier
inclusive_deberta_classifier = QualityClassifier(
filter_by=["Medium", "High"], # Keep medium and high-quality documents
model_inference_batch_size=128,
max_chars=6000
)
# FastText quality filter configurations
basic_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
)
# More selective FastText filter
selective_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq",
alpha=5, # Higher alpha for stricter filtering
seed=42
)
```
## Quality Classifier and Filter Parameters
### QualityClassifier (DeBERTa)
The `QualityClassifier` accepts the following parameters:
* `filter_by` (list, default=None): Quality levels to keep (options: "Low", "Medium", "High")
* `model_inference_batch_size` (int, default=256): Batch size for inference
* `max_chars` (int, default=6000): Max characters per document for processing
* `label_field` (str, default="quality\_pred"): Name of the prediction column
* `text_field` (str, default="text"): Name of the text field in input data
### FastTextQualityFilter
The `FastTextQualityFilter` accepts the following parameters:
* `model_path` (str, required): Path to the trained fastText model file
* `label` (str, default="\_\_label\_\_hq"): The label for high-quality documents
* `alpha` (float, default=3): Alpha parameter for Pareto distribution sampling
* `seed` (int, default=42): Random seed for reproducible sampling
## Best Practices
For effective classifier-based filtering:
1. **Model selection**: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
2. **Validation**: Manually review a sample of filtered results to confirm effectiveness
3. **Quality level tuning**: Adjust `filter_by` levels (DeBERTa) or `alpha` values (fastText) based on your quality requirements
4. **Batch size optimization**: Tune `model_inference_batch_size` for DeBERTa models based on your available memory
5. **Combination with heuristics**: Consider using heuristic filters as a pre-filter to improve efficiency
6. **Domain adaptation**: For specialized corpora, consider training custom models using domain-specific data