For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
        • Overview
        • Classifier
        • Distributed Classifier
        • Heuristic Filtering
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How It Works
  • Usage
  • Quality Classifier and Filter Parameters
  • QualityClassifier (DeBERTa)
  • FastTextQualityFilter
  • Best Practices
Curate TextProcess DataQuality Assessment

Classifier-Based Filtering

||View as Markdown|
Previous

Overview

Next

Distributed Classifier

Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.

How It Works

Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:

  • You have a reference dataset of known high-quality documents
  • The distinction between high and low quality is complex or subtle
  • You want to filter based on domain-specific characteristics

NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.

fastText is the official name and capitalization used by the fastText library created by Facebook Research.

The classifier-based filtering process involves:

  1. Preparing training data by sampling from high-quality and low-quality datasets
  2. Training a binary skip-gram classifier using fastText
  3. Using the trained model to score documents in your dataset
  4. Filtering documents based on the classifier scores, optionally using Pareto-based sampling

Usage

NeMo Curator provides two approaches for quality assessment:

  1. Classification: Use QualityClassifier to add quality predictions and optionally filter during classification
  2. Filtering: Use FastTextQualityFilter with ScoreFilter for document-level filtering with Pareto sampling

If you need to train custom fastText models for specific domains or requirements, refer to the fastText documentation for comprehensive training guides.

DeBERTa Quality Classification
FastText Quality Filter
Configuration
1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import QualityClassifier
5
6# Create pipeline with DeBERTa quality classifier
7pipeline = Pipeline(name="deberta_quality_pipeline")
8
9# Add stages
10read_stage = JsonlReader("input_data/")
11classify_stage = QualityClassifier(
12 filter_by=["High"], # Keep only high-quality documents
13 model_inference_batch_size=256,
14 max_chars=6000 # Default value
15)
16write_stage = JsonlWriter("high_quality_output/")
17
18pipeline.add_stage(read_stage)
19pipeline.add_stage(classify_stage)
20pipeline.add_stage(write_stage)
21
22# Execute pipeline
23results = pipeline.run()

Quality Classifier and Filter Parameters

QualityClassifier (DeBERTa)

The QualityClassifier accepts the following parameters:

  • filter_by (list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)
  • model_inference_batch_size (int, default=256): Batch size for inference
  • max_chars (int, default=6000): Max characters per document for processing
  • label_field (str, default=“quality_pred”): Name of the prediction column
  • text_field (str, default=“text”): Name of the text field in input data

FastTextQualityFilter

The FastTextQualityFilter accepts the following parameters:

  • model_path (str, required): Path to the trained fastText model file
  • label (str, default=“__label__hq”): The label for high-quality documents
  • alpha (float, default=3): Alpha parameter for Pareto distribution sampling
  • seed (int, default=42): Random seed for reproducible sampling

Best Practices

For effective classifier-based filtering:

  1. Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
  2. Validation: Manually review a sample of filtered results to confirm effectiveness
  3. Quality level tuning: Adjust filter_by levels (DeBERTa) or alpha values (fastText) based on your quality requirements
  4. Batch size optimization: Tune model_inference_batch_size for DeBERTa models based on your available memory
  5. Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
  6. Domain adaptation: For specialized corpora, consider training custom models using domain-specific data