Curate TextProcess DataQuality Assessment

Heuristic Filtering

View as Markdown

Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. NVIDIA NeMo Curator provides a variety of pre-built heuristic filters that can be configured and combined to meet your specific needs.

How It Works

Heuristic filters examine specific attributes of text documents and apply predefined thresholds to determine document quality. Unlike classifier-based filtering, heuristic filters don’t require training data but rely on configurable thresholds and rules.

These filters assess quality using measurable document characteristics such as:

  • Document length (word or character count)
  • Punctuation ratios and patterns
  • Repetitive content detection
  • Language-specific patterns
  • Text completeness and coherence

For details on filter structure and the filtering process, refer to Data Processing Concepts .


Usage

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.modules import ScoreFilter
5from nemo_curator.stages.text.filters import (
6 WordCountFilter,
7 RepeatingTopNGramsFilter,
8 PunctuationFilter
9)
10
11# Create pipeline
12pipeline = Pipeline(name="heuristic_filtering")
13
14# Load your dataset
15reader = JsonlReader(
16 file_paths="input_data/",
17 fields=["text", "id"]
18)
19pipeline.add_stage(reader)
20
21# Add filter stages
22pipeline.add_stage(ScoreFilter(
23 filter_obj=WordCountFilter(min_words=80),
24 text_field="text",
25 score_field="word_count"
26))
27pipeline.add_stage(ScoreFilter(
28 filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85),
29 text_field="text"
30))
31pipeline.add_stage(ScoreFilter(
32 filter_obj=RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2),
33 text_field="text"
34))
35pipeline.add_stage(ScoreFilter(
36 filter_obj=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
37 text_field="text"
38))
39pipeline.add_stage(ScoreFilter(
40 filter_obj=RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16),
41 text_field="text"
42))
43
44# Add output stage
45writer = JsonlWriter(path="high_quality_output/")
46pipeline.add_stage(writer)
47
48# Execute pipeline
49results = pipeline.run()

Available Filters

NeMo Curator includes more than 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters:

Text Length Filters

FilterDescriptionKey ParametersDefault Values
WordCountFilterFilters by word countmin_words, max_wordsmin=50, max=100000
TokenCountFilterFilters by token countmin_tokens, max_tokensmin=0, max=∞
MeanWordLengthFilterFilters by average word lengthmin_mean_word_length, max_mean_word_lengthmin=3, max=10
LongWordFilterFilters by presence of extremely long wordsmax_word_length1000

Repetition Detection Filters

FilterDescriptionKey ParametersDefault Values
RepeatedLinesFilterDetects repeated linesmax_repeated_line_fraction0.7
RepeatedParagraphsFilterDetects repeated paragraphsmax_repeated_paragraphs_ratio0.7
RepeatedLinesByCharFilterDetects repeated lines by character countmax_repeated_lines_char_ratio0.8
RepeatedParagraphsByCharFilterDetects repeated paragraphs by character countmax_repeated_paragraphs_char_ratio0.8
RepeatingTopNGramsFilterDetects excessive repetition of n-gramsn, max_repeating_ngram_ration=2, ratio=0.2
RepeatingDuplicateNGramsFilterDetects duplicate n-gramsn, max_repeating_duplicate_ngram_ration=2, ratio=0.2

Character and Symbol Filters

FilterDescriptionKey ParametersDefault Values
NonAlphaNumericFilterLimits non-alphanumeric contentmax_non_alpha_numeric_to_text_ratio0.25
SymbolsToWordsFilterLimits symbols in textmax_symbol_to_word_ratio0.1
NumbersFilterLimits numeric contentmax_number_to_text_ratio0.15
UrlsFilterLimits URL contentmax_url_to_text_ratio0.2
PunctuationFilterLimits sentences without proper punctuationmax_num_sentences_without_endmark_ratio0.85
WhiteSpaceFilterLimits excessive whitespacemax_white_space_ratio0.25

Content-specific Filters

FilterDescriptionKey ParametersDefault Values
CommonEnglishWordsFilterEnsures text contains common wordsmin_num_common_words2
WordsWithoutAlphabetsFilterLimits words without alphabetic charsmin_words_with_alphabets0.8
BulletsFilterLimits bullet-point heavy contentmax_bullet_lines_ratio0.9
BoilerPlateStringFilterDetects boilerplate textmax_boilerplate_string_ratio, remove_if_at_top_or_bottom0.4, True
ParenthesesFilterLimits parentheses contentmax_parentheses_ratio0.1

Special Purpose Filters

FilterDescriptionKey ParametersDefault Values
PornographicUrlsFilterDetects URLs containing “porn” substringNoneN/A
EllipsisFilterLimits excessive ellipsesmax_num_lines_ending_with_ellipsis_ratio0.3
HistogramFilterFilters based on character distributionthreshold0.8
SubstringFilterFilters based on presence of specific substring in a positionsubstring, positionN/A (required)

Configuration

NeMo Curator pipelines can be configured using YAML files with Hydra. The configuration uses _target_ to specify class paths:

1# Hydra-based pipeline configuration
2input_path: /path/to/input
3output_path: /path/to/output
4text_field: text
5
6stages:
7 - _target_: nemo_curator.stages.text.io.reader.JsonlReader
8 file_paths: ${input_path}
9 fields: null
10
11 - _target_: nemo_curator.stages.text.modules.score_filter.ScoreFilter
12 filter_obj:
13 _target_: nemo_curator.stages.text.filters.heuristic_filter.WordCountFilter
14 min_words: 50
15 max_words: 100000
16 text_field: ${text_field}
17 score_field: word_count
18
19 - _target_: nemo_curator.stages.text.modules.score_filter.ScoreFilter
20 filter_obj:
21 _target_: nemo_curator.stages.text.filters.heuristic_filter.PunctuationFilter
22 max_num_sentences_without_endmark_ratio: 0.85
23 text_field: ${text_field}
24 score_field: null
25
26 - _target_: nemo_curator.stages.text.io.writer.JsonlWriter
27 path: ${output_path}

See nemo_curator/config/text/ for complete pipeline examples.

For non-English texts, you may need to adjust the filter parameters based on the specific characteristics of your target language.

Best Practices

When building filter chains, follow these best practices:

1# Efficient ordering - place fast filters first
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.modules import ScoreFilter
4from nemo_curator.stages.text.filters import WordCountFilter, UrlsFilter, RepeatingTopNGramsFilter
5
6pipeline = Pipeline(name="efficient_filtering")
7# Fast filters first
8pipeline.add_stage(ScoreFilter(filter_obj=WordCountFilter(min_words=50), text_field="text"))
9# Medium complexity filters
10pipeline.add_stage(ScoreFilter(filter_obj=UrlsFilter(), text_field="text"))
11# Slow filters last
12pipeline.add_stage(ScoreFilter(filter_obj=RepeatingTopNGramsFilter(), text_field="text"))

Analyzing Filter Results

When tuning filter thresholds, analyze score distributions before applying filters. NeMo Curator provides two modules for this workflow:

  • Score: Computes scores and adds them as columns without removing documents
  • ScoreFilter: Computes scores, filters based on thresholds, and optionally retains scores in output

Use Score first to understand your data distribution, then apply ScoreFilter with tuned thresholds.

Use Score to add score columns to your data without removing any documents:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.modules import Score
5from nemo_curator.stages.text.filters import WordCountFilter, RepeatingTopNGramsFilter
6
7# Create scoring pipeline (no filtering)
8pipeline = Pipeline(name="score_analysis")
9
10# Load data
11pipeline.add_stage(JsonlReader(file_paths="input_data/", fields=["text", "id"]))
12
13# Add scores without filtering
14pipeline.add_stage(Score(
15 score_fn=WordCountFilter(min_words=80),
16 text_field="text",
17 score_field="word_count"
18))
19pipeline.add_stage(Score(
20 score_fn=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
21 text_field="text",
22 score_field="ngram_ratio"
23))
24
25# Write scored data (all documents preserved)
26pipeline.add_stage(JsonlWriter(path="scored_output/"))
27
28pipeline.run()

Output files are written to the scored_output/ directory with one file per input partition.

Performance Tuning

For large datasets, consider these performance optimizations:

XennaExecutor is the default executor, optimized for streaming workloads. You can customize its configuration or use the defaults:

1from nemo_curator.backends.xenna import XennaExecutor
2
3# Custom configuration for streaming processing
4executor = XennaExecutor(config={
5 "execution_mode": "streaming",
6 "cpu_allocation_percentage": 0.95,
7 "logging_interval": 60
8})
9results = pipeline.run(executor)

If no executor is specified, pipeline.run() uses XennaExecutor with default settings.

Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as many documents as possible. Monitor your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks.