Stopwords | NeMo Curator

Stop words are common words that are often filtered out in natural language processing (NLP) tasks because they typically don’t carry significant meaning. Curator provides built-in stop word lists for several languages to support text analysis and extraction processes.

Studies on stopword lists and their distribution in various text corpora have shown that typical English text contains 30–40% stop words.

What Are Stop Words?

Stop words are high-frequency words that generally don’t contribute much semantic value to text analysis. Examples in English include “the,” “is,” “at,” “which,” and “on.” These words appear so frequently in language that they can distort text processing tasks if not properly managed.

Key characteristics of stop words:

They appear with high frequency in text
They typically serve grammatical rather than semantic functions
They’re language-specific (each language has its own set of stop words)
Removing them can improve efficiency in NLP tasks

Why Stop Words Matter in Curator

In Curator, stop words play several important roles:

Text Extraction: The text extraction process (for Common Crawl data) uses stop word density as a key metric to identify meaningful content
Efficient Processing: Filtering stop words can reduce the amount of data to process
Boilerplate Removal: Stop word density helps differentiate between main content and boilerplate in web pages

Available Stop Word Lists

Curator leverages the extensive stop word collection from JusText for most languages. Curator also provides custom stop word lists for the following languages not covered by JusText:

Language	File Name
Chinese	`zh_stopwords.py`
Japanese	`ja_stopwords.py`
Thai	`th_stopwords.py`

These stop word lists use Python frozen sets for fast membership checks and immutability.

Chinese Stop Words

Chinese stop words in zh_stopwords.py include common function words and punctuation, for example “一个” (one), “不是” (isn’t), and “他们” (they).

1 # Example from zh_stopwords.py (partial)
2 zh_stopwords = frozenset([
3     "、", "。", "〈", "〉", "《", "》", "一", "一个",
4     # ... many more words
5 ])

Japanese Stop Words

Japanese stop words in ja_stopwords.py include common Japanese function words such as “あそこ” (there), “これ” (this), and “ます” (a polite verb ending).

1 # Example from ja_stopwords.py
2 ja_stopwords = frozenset([
3     "あそこ", "あっ", "あの", "あのかた", "あの人",
4     # ... more words
5     "私", "私達", "貴方", "貴方方",
6 ])

Thai Stop Words

Thai stop words are available in th_stopwords.py, including common Thai words like “กล่าว” (to say), “การ” (the), and “ของ” (of).

1 # Example from th_stopwords.py
2 thai_stopwords = frozenset([
3     "กล่าว", "กว่า", "กัน", "กับ", "การ", "ก็", "ก่อน",
4     # ... more words
5     "ไป", "ไม่", "ไว้",
6 ])

How Curator Uses Stop Words in Text Extraction

Stop words are a critical component in Curator’s text extraction algorithms. Here’s how they’re used in different extractors:

JusText Extractor

The JusText algorithm uses stop word density to classify text blocks as main content or boilerplate:

Context-Free Classification: The algorithm classifies text blocks with a high density of stop words as “good” (main content)
Parameter Customization: You can customize the stop word density thresholds via stopwords_low and stopwords_high parameters

1 from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor
2 
3 # Customize stop word thresholds
4 extractor = JusTextExtractor(
5     stopwords_low=0.30,   # Minimum stop word density
6     stopwords_high=0.32,  # Maximum stop word density
7 )

Other Extractors

The ResiliparseExtractor and TrafilaturaExtractor also use stop word density to filter extracted content:

1 from nemo_curator.stages.text.download.html_extractors.resiliparse import ResiliparseExtractor
2 from nemo_curator.stages.text.download.html_extractors.trafilatura import TrafilaturaExtractor
3 
4 # Resiliparse with custom stop word density
5 resiliparse = ResiliparseExtractor(
6     required_stopword_density=0.32  # Only keep paragraphs with &gt;= 32% stop words
7 )
8 
9 # Trafilatura with custom stop word density
10 trafilatura = TrafilaturaExtractor(
11     required_stopword_density=0.35  # Higher threshold for more selective extraction
12 )

Special Handling for Non-Spaced Languages

Languages like Thai, Chinese, Japanese, and Korean don’t use spaces between words, which affects how the system calculates stop word density. Curator identifies these languages via the NON_SPACED_LANGUAGES constant:

1 NON_SPACED_LANGUAGES = frozenset(["THAI", "CHINESE", "JAPANESE", "KOREAN"])

For these languages, the extractor applies special handling:

Stopword densities are still computed and used for paragraph classification.
Disable the final is_boilerplate filter to avoid over-removal in non-spaced scripts.

Creating Custom Stop Word Lists

You can create and use your own stop word lists when processing text with Curator:

1 from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
2 from nemo_curator.pipeline import Pipeline
3 
4 # Define custom stop words for multiple languages
5 custom_stop_lists = {
6     "ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"]),
7     "SPANISH": frozenset(["el", "la", "los", "las", "un", "una", "y", "o", "de", "en", "que"]),
8 }
9 
10 # Create Common Crawl processing stage with custom stop lists
11 cc_stage = CommonCrawlDownloadExtractStage(
12     start_snapshot="2023-06",
13     end_snapshot="2023-10", 
14     download_dir="/output/folder",
15     stop_lists=custom_stop_lists
16 )
17 
18 # Create and run pipeline
19 pipeline = Pipeline(name="custom_stopwords_pipeline")
20 pipeline.add_stage(cc_stage)
21 
22 # Execute pipeline
23 results = pipeline.run()

Performance Considerations

Stop word lists use frozen sets for fast membership checks (O(1) complexity)
Using appropriate stop word lists can improve extraction quality
For specialized domains, consider customizing the stop word lists

Stop Words in Text Processing