Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. NVIDIA NeMo Curator provides a variety of pre-built heuristic filters that can be configured and combined to meet your specific needs.
Heuristic filters examine specific attributes of text documents and apply predefined thresholds to determine document quality. Unlike classifier-based filtering, heuristic filters don’t require training data but rely on configurable thresholds and rules.
These filters assess quality using measurable document characteristics such as:
For details on filter structure and the filtering process, refer to Data Processing Concepts.
NeMo Curator includes more than 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters:
For non-English texts, you may need to adjust the filter parameters based on the specific characteristics of your target language.
When building filter chains, follow these best practices:
When working with non-English data or tuning your filtering pipeline, it’s valuable to examine which filters are removing documents:
For large datasets, consider these performance optimizations:
Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as many documents as possible. Monitor your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks.