Curate TextProcess DataQuality Assessment

Classifier-Based Filtering

View as Markdown

Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.

How It Works

Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:

  • You have a reference dataset of known high-quality documents
  • The distinction between high and low quality is complex or subtle
  • You want to filter based on domain-specific characteristics

NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.

fastText is the official name and capitalization used by the fastText library created by Facebook Research.

The classifier-based filtering process involves:

  1. Preparing training data by sampling from high-quality and low-quality datasets
  2. Training a binary skip-gram classifier using fastText
  3. Using the trained model to score documents in your dataset
  4. Filtering documents based on the classifier scores, optionally using Pareto-based sampling

Usage

NeMo Curator provides two approaches for quality assessment:

  1. Classification: Use QualityClassifier to add quality predictions and optionally filter during classification
  2. Filtering: Use FastTextQualityFilter with ScoreFilter for document-level filtering with Pareto sampling

If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import QualityClassifier
5
6# Create pipeline with DeBERTa quality classifier
7pipeline = Pipeline(name="deberta_quality_pipeline")
8
9# Add stages
10read_stage = JsonlReader("input_data/")
11classify_stage = QualityClassifier(
12 filter_by=["High"], # Keep only high-quality documents
13 model_inference_batch_size=256,
14 max_chars=6000 # Default value
15)
16write_stage = JsonlWriter("high_quality_output/")
17
18pipeline.add_stage(read_stage)
19pipeline.add_stage(classify_stage)
20pipeline.add_stage(write_stage)
21
22# Execute pipeline
23results = pipeline.run()

Quality Classifier and Filter Parameters

QualityClassifier (DeBERTa)

The QualityClassifier accepts the following parameters:

  • filter_by (list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)
  • model_inference_batch_size (int, default=256): Batch size for inference
  • max_chars (int, default=6000): Max characters per document for processing
  • label_field (str, default=“quality_pred”): Name of the prediction column
  • text_field (str, default=“text”): Name of the text field in input data

FastTextQualityFilter

The FastTextQualityFilter accepts the following parameters:

  • model_path (str, required): Path to the trained fastText model file
  • label (str, default=“__label__hq”): The label for high-quality documents
  • alpha (float, default=3): Alpha parameter for Pareto distribution sampling
  • seed (int, default=42): Random seed for reproducible sampling

Best Practices

For effective classifier-based filtering:

  1. Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
  2. Validation: Manually review a sample of filtered results to confirm effectiveness
  3. Quality level tuning: Adjust filter_by levels (DeBERTa) or alpha values (fastText) based on your quality requirements
  4. Batch size optimization: Tune model_inference_batch_size for DeBERTa models based on your available memory
  5. Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
  6. Domain adaptation: For specialized corpora, consider training custom models using domain-specific data