Stop words are common words that are often filtered out in natural language processing (NLP) tasks because they typically don’t carry significant meaning. Curator provides built-in stop word lists for several languages to support text analysis and extraction processes.
Studies on stopword lists and their distribution in various text corpora have shown that typical English text contains 30–40% stop words.
Stop words are high-frequency words that generally don’t contribute much semantic value to text analysis. Examples in English include “the,” “is,” “at,” “which,” and “on.” These words appear so frequently in language that they can distort text processing tasks if not properly managed.
Key characteristics of stop words:
In Curator, stop words play several important roles:
Curator leverages the extensive stop word collection from JusText for most languages. Curator also provides custom stop word lists for the following languages not covered by JusText:
These stop word lists use Python frozen sets for fast membership checks and immutability.
Chinese stop words in zh_stopwords.py include common function words and punctuation, for example “一个” (one), “不是” (isn’t), and “他们” (they).
Japanese stop words in ja_stopwords.py include common Japanese function words such as “あそこ” (there), “これ” (this), and “ます” (a polite verb ending).
Thai stop words are available in th_stopwords.py, including common Thai words like “กล่าว” (to say), “การ” (the), and “ของ” (of).
Stop words are a critical component in Curator’s text extraction algorithms. Here’s how they’re used in different extractors:
The JusText algorithm uses stop word density to classify text blocks as main content or boilerplate:
stopwords_low and stopwords_high parametersThe ResiliparseExtractor and TrafilaturaExtractor also use stop word density to filter extracted content:
Languages like Thai, Chinese, Japanese, and Korean don’t use spaces between words, which affects how the system calculates stop word density. Curator identifies these languages via the NON_SPACED_LANGUAGES constant:
For these languages, the extractor applies special handling:
is_boilerplate filter to avoid over-removal in non-spaced scripts.You can create and use your own stop word lists when processing text with Curator: