Stop Words in Text Processing
Stop words are common words that are often filtered out in natural language processing (NLP) tasks because they typically don’t carry significant meaning. Curator provides built-in stop word lists for several languages to support text analysis and extraction processes.
Studies on stopword lists and their distribution in various text corpora have shown that typical English text contains 30–40% stop words.
What Are Stop Words?
Stop words are high-frequency words that generally don’t contribute much semantic value to text analysis. Examples in English include “the,” “is,” “at,” “which,” and “on.” These words appear so frequently in language that they can distort text processing tasks if not properly managed.
Key characteristics of stop words:
- They appear with high frequency in text
- They typically serve grammatical rather than semantic functions
- They’re language-specific (each language has its own set of stop words)
- Removing them can improve efficiency in NLP tasks
Why Stop Words Matter in Curator
In Curator, stop words play several important roles:
- Text Extraction: The text extraction process (for Common Crawl data) uses stop word density as a key metric to identify meaningful content
- Efficient Processing: Filtering stop words can reduce the amount of data to process
- Boilerplate Removal: Stop word density helps differentiate between main content and boilerplate in web pages
Available Stop Word Lists
Curator leverages the extensive stop word collection from JusText for most languages. Curator also provides custom stop word lists for the following languages not covered by JusText:
These stop word lists use Python frozen sets for fast membership checks and immutability.
Chinese Stop Words
Chinese stop words in zh_stopwords.py include common function words and punctuation, for example “一个” (one), “不是” (isn’t), and “他们” (they).
Japanese Stop Words
Japanese stop words in ja_stopwords.py include common Japanese function words such as “あそこ” (there), “これ” (this), and “ます” (a polite verb ending).
Thai Stop Words
Thai stop words are available in th_stopwords.py, including common Thai words like “กล่าว” (to say), “การ” (the), and “ของ” (of).
How Curator Uses Stop Words in Text Extraction
Stop words are a critical component in Curator’s text extraction algorithms. Here’s how they’re used in different extractors:
JusText Extractor
The JusText algorithm uses stop word density to classify text blocks as main content or boilerplate:
- Context-Free Classification: The algorithm classifies text blocks with a high density of stop words as “good” (main content)
- Parameter Customization: You can customize the stop word density thresholds via
stopwords_lowandstopwords_highparameters
Other Extractors
The ResiliparseExtractor and TrafilaturaExtractor also use stop word density to filter extracted content:
Special Handling for Non-Spaced Languages
Languages like Thai, Chinese, Japanese, and Korean don’t use spaces between words, which affects how the system calculates stop word density. Curator identifies these languages via the NON_SPACED_LANGUAGES constant:
For these languages, the extractor applies special handling:
- Stopword densities are still computed and used for paragraph classification.
- Disable the final
is_boilerplatefilter to avoid over-removal in non-spaced scripts.
Creating Custom Stop Word Lists
You can create and use your own stop word lists when processing text with Curator:
Performance Considerations
- Stop word lists use frozen sets for fast membership checks (O(1) complexity)
- Using appropriate stop word lists can improve extraction quality
- For specialized domains, consider customizing the stop word lists