nemo_curator.stages.text.filters.heuristic.string
nemo_curator.stages.text.filters.heuristic.string
Module Contents
Classes
API
Bases: DocumentFilter
If more than 40% of paragraphs contain boilerplate strings, then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.
Bases: DocumentFilter
If more than 90% of the lines start with a bullet, then discard. Source: Gopher (Rae et al., 2021)
Bases: DocumentFilter
If the sentence contains at least 2 common English words, then keep it. NOTE: We purposefully check for the lowercase versions of those common words to remove documents with over-capitalization.
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Bases: DocumentFilter
If more than 30% of the sentences end with an elipsis, then discard. Source: Google C4 processing
Bases: DocumentFilter
If the document contains a word longer than 1000 characters, then discard.
NOTE: This seems to be catching things like minified .js files
that don’t have spaces anywhere.
Source: C4 (Google)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Bases: DocumentFilter
If the mean word length is not in a specified range, then discard.
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Bases: DocumentFilter
If more than 25% of the document is non-alphanumeric, then discard. Intended to be applied only to English text. Source: Adapted from Gopher (Rae et al., 2021)
Bases: DocumentFilter
If more than 85% of the sentences do not end with a punctuation mark, then discard. Source: Google C4 processing
Bases: DocumentFilter
Keeps documents that contain a substring in a given position. Gives a score of 1 if the substring is found in the given position, otherwise 0.
Bases: DocumentFilter
Remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Bases: DocumentFilter
If the document contains a significant number of white space characters, then discard.
Bases: DocumentFilter
If a document contains a number of words not within a specified range, then discard.
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Bases: DocumentFilter
80% of words in a document must contain at least one alphabetic character. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.