Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. NVIDIA NeMo Curator provides a variety of pre-built heuristic filters that can be configured and combined to meet your specific needs.
Heuristic filters examine specific attributes of text documents and apply predefined thresholds to determine document quality. Unlike classifier-based filtering, heuristic filters don’t require training data but rely on configurable thresholds and rules.
These filters assess quality using measurable document characteristics such as:
For details on filter structure and the filtering process, refer to Data Processing Concepts .
NeMo Curator includes more than 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters:
NeMo Curator pipelines can be configured using YAML files with Hydra. The configuration uses _target_ to specify class paths:
See nemo_curator/config/text/ for complete pipeline examples.
For non-English texts, you may need to adjust the filter parameters based on the specific characteristics of your target language.
When building filter chains, follow these best practices:
When tuning filter thresholds, analyze score distributions before applying filters. NeMo Curator provides two modules for this workflow:
Score: Computes scores and adds them as columns without removing documentsScoreFilter: Computes scores, filters based on thresholds, and optionally retains scores in outputUse Score first to understand your data distribution, then apply ScoreFilter with tuned thresholds.
Use Score to add score columns to your data without removing any documents:
Output files are written to the scored_output/ directory with one file per input partition.
For large datasets, consider these performance optimizations:
XennaExecutor is the default executor, optimized for streaming workloads. You can customize its configuration or use the defaults:
If no executor is specified, pipeline.run() uses XennaExecutor with default settings.
When you run a filtering pipeline, each stage tracks the number of documents it processes. You can use these metrics to understand how each filter affects your dataset and to tune thresholds.
After calling pipeline.run(), the returned task objects contain per-stage performance statistics through _stage_perf. Each entry is a StagePerfStats object with a num_items_processed field that records how many documents passed through that stage.
These same metrics power the nightly benchmarks, which track num_documents_processed, num_kept_documents, and throughput_docs_per_sec for every pipeline run.
Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as many documents as possible. Monitor your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks.