Classifier-Based Filtering#
Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.
Supported Classifier Models#
NeMo Curator supports a variety of classifier models for different filtering and classification tasks. The table below summarizes the main supported models, their backend, typical use case, and HuggingFace model link (if public):
Classifier Name |
Model Type |
Typical Use Case / Description |
HuggingFace Model/Link |
---|---|---|---|
FastTextQualityFilter |
fastText (binary classifier) |
Quality filtering, high/low quality document classification (available as filter, not distributed classifier) |
https://fasttext.cc/ |
FastTextLangId |
fastText (language identification) |
Language identification (available as filter, not distributed classifier) |
https://fasttext.cc/docs/en/language-identification.html |
QualityClassifier |
DeBERTa (transformers, HF) |
Document quality classification (multi-class, e.g., for curation) |
https://huggingface.co/nvidia/quality-classifier-deberta |
DomainClassifier |
DeBERTa (transformers, HF) |
Domain classification (English) |
https://huggingface.co/nvidia/domain-classifier |
MultilingualDomainClassifier |
mDeBERTa (transformers, HF) |
Domain classification (multilingual, 52 languages) |
https://huggingface.co/nvidia/multilingual-domain-classifier |
ContentTypeClassifier |
DeBERTa (transformers, HF) |
Content type classification (11 speech types) |
https://huggingface.co/nvidia/content-type-classifier-deberta |
AegisClassifier |
LlamaGuard-7b (LLM, PEFT, HF) |
Safety classification (AI content safety, requires access to LlamaGuard-7b) |
https://huggingface.co/meta-llama/LlamaGuard-7b |
InstructionDataGuardClassifier |
Custom neural net (used with Aegis) |
Detects instruction data poisoning |
https://huggingface.co/nvidia/instruction-data-guard |
FineWebEduClassifier |
SequenceClassification (transformers, HF) |
Educational content quality scoring (FineWeb) |
https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier |
FineWebMixtralEduClassifier |
SequenceClassification (transformers, HF) |
Educational content quality scoring (Mixtral variant) |
https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier |
FineWebNemotronEduClassifier |
SequenceClassification (transformers, HF) |
Educational content quality scoring (Nemotron-4 variant) |
https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier |
How It Works#
Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:
You have a reference dataset of known high-quality documents
The distinction between high and low quality is complex or subtle
You want to filter based on domain-specific characteristics
NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.
Note
fastText is the official name and capitalization used by the fastText library created by Facebook Research.
The classifier-based filtering process involves:
Preparing training data by sampling from high-quality and low-quality datasets
Training a binary skip-gram classifier using fastText
Using the trained model to score documents in your dataset
Filtering documents based on the classifier scores, optionally using Pareto-based sampling
Usage#
Note
Training fastText classifiers requires using CLI commands. The trained models can then be used with the Python API for filtering datasets.
1. Prepare Training Data#
First, you need to prepare training data by sampling from high-quality and low-quality datasets using the CLI command.
You can prepare training data using Python scripts:
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.pipeline import Pipeline
import random
# Sample from low-quality dataset (e.g., raw Common Crawl)
def sample_documents(input_path, output_path, num_samples, label):
pipeline = Pipeline(name="sample_data")
reader = JsonlReader(file_paths=input_path, fields=["text"])
pipeline.add_stage(reader)
# Execute pipeline to load data
results = pipeline.run()
# Sample and save with labels for fastText format
with open(output_path, 'w') as f:
for result in results:
data = result.to_pandas()
sampled = data.sample(min(num_samples, len(data)))
for _, row in sampled.iterrows():
f.write(f"{label} {row['text'].replace(chr(10), ' ')}\n")
# Create training samples
sample_documents(
"/path/to/common-crawl/*.jsonl",
"./cc_samples.txt",
10000,
"__label__cc"
)
sample_documents(
"/path/to/wikipedia/*.jsonl",
"./hq_samples.txt",
10000,
"__label__hq"
)
Command Parameters#
Parameter |
Description |
Example |
---|---|---|
|
Directory containing JSONL files to sample from |
|
|
Number of documents to sample |
|
|
FastText label prefix for the category |
|
|
Output file for training samples |
|
2. Use Pre-trained Quality Classifiers#
For most quality filtering use cases, we recommend using the pre-trained quality classifiers available in NeMo Curator rather than training custom fastText models:
Option A: DeBERTa Quality Classifier (Recommended)#
Use the pre-trained nvidia/quality-classifier-deberta
model for robust quality assessment:
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.classifiers import QualityClassifier
# Create pipeline with DeBERTa quality classifier
pipeline = Pipeline(name="quality_classification_pipeline")
# Load dataset
read_stage = JsonlReader(
file_paths="input_data/",
fields=["text", "id"]
)
pipeline.add_stage(read_stage)
# Apply quality classifier
classify_stage = QualityClassifier(
max_chars=6000, # Default is 6000
model_inference_batch_size=256
)
pipeline.add_stage(classify_stage)
# Save results
write_stage = JsonlWriter(path="classified_output/")
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run() # Uses XennaExecutor by default
Option B: Custom fastText Models#
If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.
Note
When to use each approach:
QualityClassifier (DeBERTa): Higher accuracy, multi-class output (Low/Medium/High), better for general quality assessment. Use the
filter_by
parameter to filter during classification or omit it to add predictions as metadata only.FastTextQualityFilter: Faster inference, lower memory usage, includes Pareto sampling, better for high-throughput scenarios or when you have domain-specific training data.
3. Use Quality Assessment#
NeMo Curator provides two approaches for quality assessment:
Classification: Use
QualityClassifier
to add quality predictions and optionally filter during classificationFiltering: Use
FastTextQualityFilter
withScoreFilter
for document-level filtering with Pareto sampling
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.classifiers import QualityClassifier
# Create pipeline with DeBERTa quality classifier
pipeline = Pipeline(name="deberta_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
classify_stage = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(classify_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import FastTextQualityFilter
# Create pipeline with FastText filter (requires pre-trained model)
pipeline = Pipeline(name="fasttext_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
filter_stage = ScoreFilter(
FastTextQualityFilter(
model_path="./quality_classifier.bin", # Path to your fastText model
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
),
text_field="text",
score_field="quality_score"
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(filter_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()
You can configure quality classifiers and filters with different parameters:
from nemo_curator.stages.text.classifiers import QualityClassifier
from nemo_curator.stages.text.filters import FastTextQualityFilter
# DeBERTa quality classifier configurations
basic_deberta_classifier = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
# More inclusive DeBERTa classifier
inclusive_deberta_classifier = QualityClassifier(
filter_by=["Medium", "High"], # Keep medium and high-quality documents
model_inference_batch_size=128,
max_chars=6000
)
# FastText quality filter configurations
basic_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
)
# More selective FastText filter
selective_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq",
alpha=5, # Higher alpha for stricter filtering
seed=42
)
Pareto-Based Sampling#
NeMo Curator’s implementation includes support for Pareto-based sampling, as described in Brown et al., 2020. This approach:
Scores documents using the trained classifier
Ranks documents based on their scores
Samples documents according to a Pareto distribution, favoring higher-ranked documents
This method helps maintain diversity in the dataset while still prioritizing higher-quality documents.
Quality Classifier and Filter Parameters#
QualityClassifier (DeBERTa)#
The QualityClassifier
accepts the following parameters:
filter_by
(list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)model_inference_batch_size
(int, default=256): Batch size for inferencemax_chars
(int, default=6000): Max characters per document for processingpred_column
(str, default=”quality_pred”): Name of the prediction columntext_field
(str, default=”text”): Name of the text field in input data
FastTextQualityFilter#
The FastTextQualityFilter
accepts the following parameters:
model_path
(str, required): Path to the trained fastText model filelabel
(str, default=”__label__hq”): The label for high-quality documentsalpha
(float, default=3): Alpha parameter for Pareto distribution samplingseed
(int, default=42): Random seed for reproducible sampling
Configuration#
Typical configurations for quality-based filtering:
DeBERTa Quality Classifier#
classifiers:
- name: QualityClassifier
filter_by: ["High"]
model_inference_batch_size: 256
max_chars: 6000
pred_column: quality_pred
text_field: text
FastText Quality Filter#
filters:
- name: ScoreFilter
filter:
name: FastTextQualityFilter
model_path: /path/to/quality_classifier.bin
label: __label__hq
alpha: 3
seed: 42
text_field: text
score_field: quality_score
Best Practices#
For effective classifier-based filtering:
Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
Validation: Manually review a sample of filtered results to confirm effectiveness
Quality level tuning: Adjust
filter_by
levels (DeBERTa) oralpha
values (fastText) based on your quality requirementsBatch size optimization: Tune
batch_size
for DeBERTa models based on your available memoryCombination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
Domain adaptation: For specialized corpora, consider training custom models using domain-specific data