Heuristic Filtering | NeMo Curator

Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. NVIDIA NeMo Curator provides a variety of pre-built heuristic filters that can be configured and combined to meet your specific needs.

How It Works

Heuristic filters examine specific attributes of text documents and apply predefined thresholds to determine document quality. Unlike classifier-based filtering, heuristic filters don’t require training data but rely on configurable thresholds and rules.

These filters assess quality using measurable document characteristics such as:

Document length (word or character count)
Punctuation ratios and patterns
Repetitive content detection
Language-specific patterns
Text completeness and coherence

For details on filter structure and the filtering process, refer to Data Processing Concepts .

Usage

Python

Configuration

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.filters import ScoreFilter
5 from nemo_curator.stages.text.filters.heuristic import WordCountFilter, PunctuationFilter
6 from nemo_curator.stages.text.filters.heuristic.repetition import RepeatingTopNGramsFilter
7 
8 # Create pipeline
9 pipeline = Pipeline(name="heuristic_filtering")
10 
11 # Load your dataset
12 reader = JsonlReader(
13     file_paths="input_data/",
14     fields=["text", "id"]
15 )
16 pipeline.add_stage(reader)
17 
18 # Add filter stages
19 pipeline.add_stage(ScoreFilter(
20     filter_obj=WordCountFilter(min_words=80),
21     text_field="text",
22     score_field="word_count"
23 ))
24 pipeline.add_stage(ScoreFilter(
25     filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85),
26     text_field="text"
27 ))
28 pipeline.add_stage(ScoreFilter(
29     filter_obj=RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2),
30     text_field="text"
31 ))
32 pipeline.add_stage(ScoreFilter(
33     filter_obj=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
34     text_field="text"
35 ))
36 pipeline.add_stage(ScoreFilter(
37     filter_obj=RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16),
38     text_field="text"
39 ))
40 
41 # Add output stage
42 writer = JsonlWriter(path="high_quality_output/")
43 pipeline.add_stage(writer)
44 
45 # Execute pipeline
46 results = pipeline.run()

Available Filters

NeMo Curator includes more than 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters:

Text Length Filters

Filter	Description	Key Parameters	Default Values
WordCountFilter	Filters by word count	`min_words`, `max_words`	min=50, max=100000
TokenCountFilter	Filters by token count	`min_tokens`, `max_tokens`	min=0, max=∞
MeanWordLengthFilter	Filters by average word length	`min_mean_word_length`, `max_mean_word_length`	min=3, max=10
LongWordFilter	Filters by presence of extremely long words	`max_word_length`	1000

Repetition Detection Filters

Filter	Description	Key Parameters	Default Values
RepeatedLinesFilter	Detects repeated lines	`max_repeated_line_fraction`	0.7
RepeatedParagraphsFilter	Detects repeated paragraphs	`max_repeated_paragraphs_ratio`	0.7
RepeatedLinesByCharFilter	Detects repeated lines by character count	`max_repeated_lines_char_ratio`	0.8
RepeatedParagraphsByCharFilter	Detects repeated paragraphs by character count	`max_repeated_paragraphs_char_ratio`	0.8
RepeatingTopNGramsFilter	Detects excessive repetition of n-grams	`n`, `max_repeating_ngram_ratio`	n=2, ratio=0.2
RepeatingDuplicateNGramsFilter	Detects duplicate n-grams	`n`, `max_repeating_duplicate_ngram_ratio`	n=2, ratio=0.2

Character and Symbol Filters

Filter	Description	Key Parameters	Default Values
NonAlphaNumericFilter	Limits non-alphanumeric content	`max_non_alpha_numeric_to_text_ratio`	0.25
SymbolsToWordsFilter	Limits symbols in text	`max_symbol_to_word_ratio`	0.1
NumbersFilter	Limits numeric content	`max_number_to_text_ratio`	0.15
UrlsFilter	Limits URL content	`max_url_to_text_ratio`	0.2
PunctuationFilter	Limits sentences without proper punctuation	`max_num_sentences_without_endmark_ratio`	0.85
WhiteSpaceFilter	Limits excessive whitespace	`max_white_space_ratio`	0.25

Content-specific Filters

Filter	Description	Key Parameters	Default Values
CommonEnglishWordsFilter	Ensures text contains common words	`min_num_common_words`	2
WordsWithoutAlphabetsFilter	Limits words without alphabetic chars	`min_words_with_alphabets`	0.8
BulletsFilter	Limits bullet-point heavy content	`max_bullet_lines_ratio`	0.9
BoilerPlateStringFilter	Detects boilerplate text	`max_boilerplate_string_ratio`, `remove_if_at_top_or_bottom`	0.4, True
ParenthesesFilter	Limits parentheses content	`max_parentheses_ratio`	0.1

Special Purpose Filters

Filter	Description	Key Parameters	Default Values
PornographicUrlsFilter	Detects URLs containing “porn” substring	None	N/A
EllipsisFilter	Limits excessive ellipses	`max_num_lines_ending_with_ellipsis_ratio`	0.3
HistogramFilter	Filters based on character distribution	`threshold`	0.8
SubstringFilter	Filters based on presence of specific substring in a position	`substring`, `position`	N/A (required)

Configuration

NeMo Curator pipelines can be configured using YAML files with Hydra. The configuration uses _target_ to specify class paths:

Hydra Configuration

1 # Hydra-based pipeline configuration
2 input_path: /path/to/input
3 output_path: /path/to/output
4 text_field: text
5 
6 stages:
7   - _target_: nemo_curator.stages.text.io.reader.JsonlReader
8     file_paths: ${input_path}
9     fields: null
10 
11   - _target_: nemo_curator.stages.text.filters.score_filter.ScoreFilter
12     filter_obj:
13       _target_: nemo_curator.stages.text.filters.heuristic.string.WordCountFilter
14       min_words: 50
15       max_words: 100000
16     text_field: ${text_field}
17     score_field: word_count
18 
19   - _target_: nemo_curator.stages.text.filters.score_filter.ScoreFilter
20     filter_obj:
21       _target_: nemo_curator.stages.text.filters.heuristic.string.PunctuationFilter
22       max_num_sentences_without_endmark_ratio: 0.85
23     text_field: ${text_field}
24     score_field: null
25 
26   - _target_: nemo_curator.stages.text.io.writer.JsonlWriter
27     path: ${output_path}

See nemo_curator/config/text/ for complete pipeline examples.

For non-English texts, you may need to adjust the filter parameters based on the specific characteristics of your target language.

Best Practices

When building filter chains, follow these best practices:

Order for Efficiency

Performance Tuning

Precision vs. Recall

Language Considerations

Multiple Filters

1 # Efficient ordering - place fast filters first
2 from nemo_curator.pipeline import Pipeline
3 from nemo_curator.stages.text.filters import ScoreFilter
4 from nemo_curator.stages.text.filters.heuristic import WordCountFilter, UrlsFilter
5 from nemo_curator.stages.text.filters.heuristic.repetition import RepeatingTopNGramsFilter
6 
7 pipeline = Pipeline(name="efficient_filtering")
8 # Fast filters first
9 pipeline.add_stage(ScoreFilter(filter_obj=WordCountFilter(min_words=50), text_field="text"))
10 # Medium complexity filters
11 pipeline.add_stage(ScoreFilter(filter_obj=UrlsFilter(), text_field="text"))
12 # Slow filters last
13 pipeline.add_stage(ScoreFilter(filter_obj=RepeatingTopNGramsFilter(), text_field="text"))

Analyzing Filter Results

When tuning filter thresholds, analyze score distributions before applying filters. NeMo Curator provides two modules for this workflow:

Score: Computes scores and adds them as columns without removing documents
ScoreFilter: Computes scores, filters based on thresholds, and optionally retains scores in output

Use Score first to understand your data distribution, then apply ScoreFilter with tuned thresholds.

Score Without Filtering

Analyze Score Distribution

Apply Tuned Filters

Use Score to add score columns to your data without removing any documents:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.filters import Score
5 from nemo_curator.stages.text.filters.heuristic import WordCountFilter
6 from nemo_curator.stages.text.filters.heuristic.repetition import RepeatingTopNGramsFilter
7 
8 # Create scoring pipeline (no filtering)
9 pipeline = Pipeline(name="score_analysis")
10 
11 # Load data
12 pipeline.add_stage(JsonlReader(file_paths="input_data/", fields=["text", "id"]))
13 
14 # Add scores without filtering
15 pipeline.add_stage(Score(
16     score_fn=WordCountFilter(min_words=80),
17     text_field="text",
18     score_field="word_count"
19 ))
20 pipeline.add_stage(Score(
21     score_fn=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
22     text_field="text",
23     score_field="ngram_ratio"
24 ))
25 
26 # Write scored data (all documents preserved)
27 pipeline.add_stage(JsonlWriter(path="scored_output/"))
28 
29 pipeline.run()

Output files are written to the scored_output/ directory with one file per input partition.

Performance Tuning

For large datasets, consider these performance optimizations:

XennaExecutor (Default)

RayDataExecutor

Batch Size Optimization

XennaExecutor is the default executor, optimized for streaming workloads. You can customize its configuration or use the defaults:

1 from nemo_curator.backends.xenna import XennaExecutor
2 
3 # Custom configuration for streaming processing
4 executor = XennaExecutor(config={
5     "execution_mode": "streaming",
6     "cpu_allocation_percentage": 0.95,
7     "logging_interval": 60
8 })
9 results = pipeline.run(executor)

If no executor is specified, pipeline.run() uses XennaExecutor with default settings.

Pipeline Metrics

When you run a filtering pipeline, each stage tracks the number of documents it processes. You can use these metrics to understand how each filter affects your dataset and to tune thresholds.

After calling pipeline.run(), the returned task objects contain per-stage performance statistics through _stage_perf. Each entry is a StagePerfStats object with a num_items_processed field that records how many documents passed through that stage.

1 # Run the pipeline and inspect filter metrics
2 output_tasks = pipeline.run()
3 
4 for task in output_tasks:
5     # _stage_perf[0] is file partitioning, _stage_perf[1] is the reader
6     num_input = task._stage_perf[1].num_items_processed
7     # The last stage is the writer — its count reflects documents that survived all filters
8     num_output = task._stage_perf[-1].num_items_processed
9 
10     if num_input > 0:
11         print(f"Task {task.task_id}: {num_input} input → {num_output} kept ({num_output / num_input:.1%})")
12     else:
13         print(f"Task {task.task_id}: 0 input → 0 kept")

These same metrics power the nightly benchmarks, which track num_documents_processed, num_kept_documents, and throughput_docs_per_sec for every pipeline run.

Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as many documents as possible. Monitor your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks.