Classifier-Based Filtering

Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.

How It Works

Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:

You have a reference dataset of known high-quality documents
The distinction between high and low quality is complex or subtle
You want to filter based on domain-specific characteristics

NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.

fastText is the official name and capitalization used by the fastText library created by Facebook Research.

The classifier-based filtering process involves:

Preparing training data by sampling from high-quality and low-quality datasets
Training a binary skip-gram classifier using fastText
Using the trained model to score documents in your dataset
Filtering documents based on the classifier scores, optionally using Pareto-based sampling

Usage

NeMo Curator provides two approaches for quality assessment:

Classification: Use QualityClassifier to add quality predictions and optionally filter during classification
Filtering: Use FastTextQualityFilter with ScoreFilter for document-level filtering with Pareto sampling

If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.

DeBERTa Quality Classification

FastText Quality Filter

Configuration

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import QualityClassifier
5 
6 # Create pipeline with DeBERTa quality classifier
7 pipeline = Pipeline(name="deberta_quality_pipeline")
8 
9 # Add stages
10 read_stage = JsonlReader("input_data/")
11 classify_stage = QualityClassifier(
12     filter_by=["High"],  # Keep only high-quality documents
13     model_inference_batch_size=256,
14     max_chars=6000  # Default value
15 )
16 write_stage = JsonlWriter("high_quality_output/")
17 
18 pipeline.add_stage(read_stage)
19 pipeline.add_stage(classify_stage)
20 pipeline.add_stage(write_stage)
21 
22 # Execute pipeline
23 results = pipeline.run()

Quality Classifier and Filter Parameters

QualityClassifier (DeBERTa)

The QualityClassifier accepts the following parameters:

filter_by (list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)
model_inference_batch_size (int, default=256): Batch size for inference
max_chars (int, default=6000): Max characters per document for processing
label_field (str, default=“quality_pred”): Name of the prediction column
text_field (str, default=“text”): Name of the text field in input data

FastTextQualityFilter

The FastTextQualityFilter accepts the following parameters:

model_path (str, required): Path to the trained fastText model file
label (str, default=“__label__hq”): The label for high-quality documents
alpha (float, default=3): Alpha parameter for Pareto distribution sampling
seed (int, default=42): Random seed for reproducible sampling

Best Practices

For effective classifier-based filtering:

Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
Validation: Manually review a sample of filtered results to confirm effectiveness
Quality level tuning: Adjust filter_by levels (DeBERTa) or alpha values (fastText) based on your quality requirements
Batch size optimization: Tune model_inference_batch_size for DeBERTa models based on your available memory
Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
Domain adaptation: For specialized corpora, consider training custom models using domain-specific data

How It Works

You have a reference dataset of known high-quality documents
The distinction between high and low quality is complex or subtle
You want to filter based on domain-specific characteristics

NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.

fastText is the official name and capitalization used by the fastText library created by Facebook Research.

The classifier-based filtering process involves:

Preparing training data by sampling from high-quality and low-quality datasets
Training a binary skip-gram classifier using fastText
Using the trained model to score documents in your dataset
Filtering documents based on the classifier scores, optionally using Pareto-based sampling

Usage

NeMo Curator provides two approaches for quality assessment:

Classification: Use QualityClassifier to add quality predictions and optionally filter during classification
Filtering: Use FastTextQualityFilter with ScoreFilter for document-level filtering with Pareto sampling

If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.

DeBERTa Quality Classification

FastText Quality Filter

Configuration

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import QualityClassifier
5 
6 # Create pipeline with DeBERTa quality classifier
7 pipeline = Pipeline(name="deberta_quality_pipeline")
8 
9 # Add stages
10 read_stage = JsonlReader("input_data/")
11 classify_stage = QualityClassifier(
12     filter_by=["High"],  # Keep only high-quality documents
13     model_inference_batch_size=256,
14     max_chars=6000  # Default value
15 )
16 write_stage = JsonlWriter("high_quality_output/")
17 
18 pipeline.add_stage(read_stage)
19 pipeline.add_stage(classify_stage)
20 pipeline.add_stage(write_stage)
21 
22 # Execute pipeline
23 results = pipeline.run()

Quality Classifier and Filter Parameters

QualityClassifier (DeBERTa)

The QualityClassifier accepts the following parameters:

filter_by (list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)
model_inference_batch_size (int, default=256): Batch size for inference
max_chars (int, default=6000): Max characters per document for processing
label_field (str, default=“quality_pred”): Name of the prediction column
text_field (str, default=“text”): Name of the text field in input data

FastTextQualityFilter

The FastTextQualityFilter accepts the following parameters:

model_path (str, required): Path to the trained fastText model file
label (str, default=“__label__hq”): The label for high-quality documents
alpha (float, default=3): Alpha parameter for Pareto distribution sampling
seed (int, default=42): Random seed for reproducible sampling

Best Practices

For effective classifier-based filtering:

Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
Validation: Manually review a sample of filtered results to confirm effectiveness
Quality level tuning: Adjust filter_by levels (DeBERTa) or alpha values (fastText) based on your quality requirements
Batch size optimization: Tune model_inference_batch_size for DeBERTa models based on your available memory
Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
Domain adaptation: For specialized corpora, consider training custom models using domain-specific data

1	from nemo_curator.pipeline import Pipeline
2	from nemo_curator.stages.text.io.reader import JsonlReader
3	from nemo_curator.stages.text.io.writer import JsonlWriter
4	from nemo_curator.stages.text.classifiers import QualityClassifier
5
6	# Create pipeline with DeBERTa quality classifier
7	pipeline = Pipeline(name="deberta_quality_pipeline")
8
9	# Add stages
10	read_stage = JsonlReader("input_data/")
11	classify_stage = QualityClassifier(
12	filter_by=["High"], # Keep only high-quality documents
13	model_inference_batch_size=256,
14	max_chars=6000 # Default value
15	)
16	write_stage = JsonlWriter("high_quality_output/")
17
18	pipeline.add_stage(read_stage)
19	pipeline.add_stage(classify_stage)
20	pipeline.add_stage(write_stage)
21
22	# Execute pipeline
23	results = pipeline.run()