Heuristic Filtering#
Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. NVIDIA NeMo Curator provides a variety of pre-built heuristic filters that can be configured and combined to meet your specific needs.
How It Works#
Heuristic filters examine specific attributes of text documents and apply predefined thresholds to determine document quality. Unlike classifier-based filtering, heuristic filters don’t require training data but rely on configurable thresholds and rules.
These filters assess quality using measurable document characteristics such as:
Document length (word or character count)
Punctuation ratios and patterns
Repetitive content detection
Language-specific patterns
Text completeness and coherence
Each heuristic filter follows a consistent structure:
class ExampleFilter(DocumentFilter):
def __init__(self, parameter1=default1, parameter2=default2):
super().__init__()
self._param1 = parameter1
self._param2 = parameter2
self._name = "example_filter"
def score_document(self, text):
# Calculate and return a score between 0 and 1
# Higher scores typically indicate lower quality
score = compute_score(text)
return score
def keep_document(self, score):
# Return True to keep the document, False to filter it out
return score <= self._param1
The filtering process typically involves:
Calculating a quality score for each document
Applying a threshold to determine whether to keep or discard the document
Optionally storing the score as metadata for later analysis
Usage#
import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
WordCountFilter,
RepeatingTopNGramsFilter,
PunctuationFilter
)
# Load your dataset
dataset = DocumentDataset.read_json("input_data/*.jsonl")
# Create a filter chain using Sequential
filter_step = nc.Sequential([
nc.ScoreFilter(
WordCountFilter(min_words=80),
text_field="text",
score_field="word_count",
),
nc.ScoreFilter(PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85)),
nc.ScoreFilter(RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2)),
nc.ScoreFilter(RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18)),
nc.ScoreFilter(RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16)),
])
# Apply the filters to get the high-quality subset
high_quality_data = filter_step(dataset)
# Save the results
high_quality_data.to_json("high_quality_output/", write_to_filename=True)
filter_documents \
--input-data-dir=/path/to/input/data \
--filter-config-file=./config/heuristic_filter_en.yaml \
--output-retained-document-dir=/path/to/output/high_quality \
--output-removed-document-dir=/path/to/output/low_quality \
--output-document-score-dir=/path/to/output/scores \
--log-dir=/path/to/logs/heuristic_filter
Available Filters#
NeMo Curator includes over 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters:
Text Length Filters#
Filter |
Description |
Key Parameters |
Default Values |
---|---|---|---|
WordCountFilter |
Filters by word count |
|
min=50, max=100000 |
TokenCountFilter |
Filters by token count |
|
min=0, max=∞ |
MeanWordLengthFilter |
Filters by average word length |
|
min=3, max=10 |
LongWordFilter |
Filters by presence of extremely long words |
|
1000 |
Repetition Detection Filters#
Filter |
Description |
Key Parameters |
Default Values |
---|---|---|---|
RepeatedLinesFilter |
Detects repeated lines |
|
0.7 |
RepeatedParagraphsFilter |
Detects repeated paragraphs |
|
0.7 |
RepeatingTopNGramsFilter |
Detects excessive repetition of n-grams |
|
n=2, ratio=0.2 |
RepeatingDuplicateNGramsFilter |
Detects duplicate n-grams |
|
n=2, ratio=0.2 |
Character and Symbol Filters#
Filter |
Description |
Key Parameters |
Default Values |
---|---|---|---|
NonAlphaNumericFilter |
Limits non-alphanumeric content |
|
0.25 |
SymbolsToWordsFilter |
Limits symbols in text |
|
0.1 |
NumbersFilter |
Limits numeric content |
|
0.15 |
UrlsFilter |
Limits URL content |
|
0.2 |
PunctuationFilter |
Limits sentences without proper punctuation |
|
0.85 |
WhiteSpaceFilter |
Limits excessive whitespace |
|
0.25 |
Content-specific Filters#
Filter |
Description |
Key Parameters |
Default Values |
---|---|---|---|
CommonEnglishWordsFilter |
Ensures text contains common words |
|
2 |
WordsWithoutAlphabetsFilter |
Limits words without alphabetic chars |
|
0.8 |
BulletsFilter |
Limits bullet-point heavy content |
|
0.9 |
BoilerPlateStringFilter |
Detects boilerplate text |
|
0.4, True |
ParenthesesFilter |
Limits parentheses content |
|
0.1 |
Special Purpose Filters#
Filter |
Description |
Key Parameters |
Default Values |
---|---|---|---|
PornographicUrlsFilter |
Detects URLs containing “porn” substring |
None |
N/A |
EllipsisFilter |
Limits excessive ellipses |
|
0.3 |
HistogramFilter |
Filters based on character distribution |
|
0.8 |
SubstringFilter |
Filters based on presence of specific substring in a position |
|
“”, “any” |
Configuration#
# Sample filter configuration (simplified)
filters:
- name: ScoreFilter
filter:
name: WordCountFilter
min_words: 50
max_words: 100000
text_field: text
score_field: word_count
- name: ScoreFilter
filter:
name: PunctuationFilter
max_num_sentences_without_endmark_ratio: 0.85
text_field: text
score_field: punctuation_ratio
- name: ScoreFilter
filter:
name: RepeatingTopNGramsFilter
n: 2
max_repeating_ngram_ratio: 0.18
text_field: text
score_field: ngram_repetition
The configuration file config/heuristic_filter_en.yaml
contains a general-purpose set of heuristic filters that work well for English text. For non-English texts, you may need to adjust the filter parameters.
Best Practices#
When building filter chains, follow these best practices:
# Efficient ordering
filter_chain = nc.Sequential([
nc.ScoreFilter(WordCountFilter(min_words=50)), # Fast
nc.ScoreFilter(UrlsFilter()), # Medium
nc.ScoreFilter(RepeatingTopNGramsFilter()) # Slow
])
from nemo_curator.utils.decorators import batched
class MyCustomFilter(DocumentFilter):
@batched
def keep_document(self, scores):
return scores <= self.threshold
# More permissive (higher recall)
lenient_filter = WordCountFilter(min_words=10, max_words=100000)
# More strict (higher precision)
strict_filter = WordCountFilter(min_words=100, max_words=10000)
# Chinese text filter
cn_filter = nc.ScoreFilter(
SymbolsToWordsFilter(max_symbol_to_word_ratio=0.15, lang="zh")
)
# Comprehensive quality filter
quality_chain = nc.Sequential([
# Basic text quality
nc.ScoreFilter(WordCountFilter(min_words=50)),
nc.ScoreFilter(PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85)),
# Content quality
nc.ScoreFilter(CommonEnglishWordsFilter(min_num_common_words=2)),
# Repetition detection
nc.ScoreFilter(RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18))
])
Analyzing Filter Results#
When working with non-English data or tuning your filtering pipeline, it’s valuable to examine which filters are removing documents:
import pandas as pd
# Load scores from filter run
scores = pd.read_json("output/scores/scores.jsonl", lines=True)
# Analyze rejection reasons
rejection_counts = scores[scores["rejected"] == True].groupby("rejected_by").size()
print(f"Documents rejected by filter:\n{rejection_counts}")
# Analyze score distributions
import matplotlib.pyplot as plt
scores.hist(column="word_count", bins=50)
plt.title("Word Count Distribution")
plt.savefig("word_count_hist.png")
Performance Tuning#
For large datasets, consider these performance optimizations:
# Process in chunks to reduce memory usage
for chunk in DocumentDataset.read_json_chunks("large_dataset/*.jsonl", chunk_size=10000):
filtered_chunk = filter_step(chunk)
filtered_chunk.to_json("output/", mode="append")
# Use multiple processes with CLI
filter_documents --input-data-dir=input/ --num-proc=8 --filter-config-file=config.yaml --output-retained-document-dir=output/
# Adjust batch size for specific filters
from nemo_curator.utils.decorators import batched
class CustomBatchFilter(DocumentFilter):
@batched(batch_size=5000) # Set custom batch size
def keep_document(self, scores):
return scores <= self.threshold
Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as many documents as possible. Monitor your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks.