nemo_curator.stages.text.filters.heuristic.string

View as Markdown

Module Contents

Classes

NameDescription
BoilerPlateStringFilterIf more than 40% of paragraphs contain boilerplate strings, then discard.
BulletsFilterIf more than 90% of the lines start with a bullet, then discard.
CommonEnglishWordsFilterIf the sentence contains at least 2 common English words, then keep it.
EllipsisFilterIf more than 30% of the sentences end with an elipsis, then discard.
LongWordFilterIf the document contains a word longer than 1000 characters, then discard.
MeanWordLengthFilterIf the mean word length is not in a specified range, then discard.
NonAlphaNumericFilterIf more than 25% of the document is non-alphanumeric, then discard.
NumbersFilterIf more than 15% of the document contains numbers, then discard.
ParenthesesFilterIf more than 10% of the sentence is in parentheses, then discard.
PornographicUrlsFilterCheck if any of the URLs within the document point to pornography.
PunctuationFilterIf more than 85% of the sentences do not end with a
SubstringFilterKeeps documents that contain a substring in a given position.
SymbolsToWordsFilterRemove any document with a symbol-to-word ratio greater than
UrlsFilterIf more than 20% of the document is comprised of URLs, then discard.
WhiteSpaceFilterIf the document contains a significant number
WordCountFilterIf a document contains a number of words not
WordsWithoutAlphabetsFilter80% of words in a document must contain at least one alphabetic character.

API

class nemo_curator.stages.text.filters.heuristic.string.BoilerPlateStringFilter(
remove_if_at_top_or_bottom: bool = True,
max_boilerplate_string_ratio: float = 0.4
)

Bases: DocumentFilter

If more than 40% of paragraphs contain boilerplate strings, then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

_boilerplate_paragraph_indices
= []
_max_ratio
= 1.0
_name
= 'boilerplate_string_ratio'
nemo_curator.stages.text.filters.heuristic.string.BoilerPlateStringFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.BoilerPlateStringFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.BulletsFilter(
max_bullet_lines_ratio: float = 0.9
)

Bases: DocumentFilter

If more than 90% of the lines start with a bullet, then discard. Source: Gopher (Rae et al., 2021)

_name
= 'bullet_ratio'
nemo_curator.stages.text.filters.heuristic.string.BulletsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.BulletsFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.CommonEnglishWordsFilter(
min_num_common_words: int = 2,
stop_at_false: bool = True
)

Bases: DocumentFilter

If the sentence contains at least 2 common English words, then keep it. NOTE: We purposefully check for the lowercase versions of those common words to remove documents with over-capitalization.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_name
= 'common_english_words'
_word_splitter
= get_word_splitter('en')
nemo_curator.stages.text.filters.heuristic.string.CommonEnglishWordsFilter.keep_document(
score: int
) -> bool
nemo_curator.stages.text.filters.heuristic.string.CommonEnglishWordsFilter.score_document(
text: str
) -> int
class nemo_curator.stages.text.filters.heuristic.string.EllipsisFilter(
max_num_lines_ending_with_ellipsis_ratio: float = 0.3
)

Bases: DocumentFilter

If more than 30% of the sentences end with an elipsis, then discard. Source: Google C4 processing

_name
= 'ellipsis'
nemo_curator.stages.text.filters.heuristic.string.EllipsisFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.EllipsisFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.LongWordFilter(
max_word_length: int = 1000,
lang: str = 'en'
)

Bases: DocumentFilter

If the document contains a word longer than 1000 characters, then discard. NOTE: This seems to be catching things like minified .js files that don’t have spaces anywhere. Source: C4 (Google)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_name
= 'max_word_length'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.string.LongWordFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.LongWordFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.MeanWordLengthFilter(
min_mean_word_length: int = 3,
max_mean_word_length: int = 10,
lang: str = 'en'
)

Bases: DocumentFilter

If the mean word length is not in a specified range, then discard.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_name
= 'mean_word_length'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.string.MeanWordLengthFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.MeanWordLengthFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.NonAlphaNumericFilter(
max_non_alpha_numeric_to_text_ratio: float = 0.25
)

Bases: DocumentFilter

If more than 25% of the document is non-alphanumeric, then discard. Intended to be applied only to English text. Source: Adapted from Gopher (Rae et al., 2021)

_name
= 'alpha_numeric'
nemo_curator.stages.text.filters.heuristic.string.NonAlphaNumericFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.NonAlphaNumericFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.NumbersFilter(
max_number_to_text_ratio: float = 0.15
)

Bases: DocumentFilter

If more than 15% of the document contains numbers, then discard.

_name
= 'numbers_ratio'
nemo_curator.stages.text.filters.heuristic.string.NumbersFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.NumbersFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.ParenthesesFilter(
max_parentheses_ratio: float = 0.1
)

Bases: DocumentFilter

If more than 10% of the sentence is in parentheses, then discard.

_name
= 'parentheses_ratio'
nemo_curator.stages.text.filters.heuristic.string.ParenthesesFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.ParenthesesFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.PornographicUrlsFilter()

Bases: DocumentFilter

Check if any of the URLs within the document point to pornography.

nemo_curator.stages.text.filters.heuristic.string.PornographicUrlsFilter.keep_document(
score: int
) -> bool
nemo_curator.stages.text.filters.heuristic.string.PornographicUrlsFilter.score_document(
text: str
) -> int
class nemo_curator.stages.text.filters.heuristic.string.PunctuationFilter(
max_num_sentences_without_endmark_ratio: float = 0.85
)

Bases: DocumentFilter

If more than 85% of the sentences do not end with a punctuation mark, then discard. Source: Google C4 processing

_name
= 'punctuation'
nemo_curator.stages.text.filters.heuristic.string.PunctuationFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.PunctuationFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.SubstringFilter(
substring: str,
position: typing.Literal['prefix', 'suffix', 'any']
)

Bases: DocumentFilter

Keeps documents that contain a substring in a given position. Gives a score of 1 if the substring is found in the given position, otherwise 0.

nemo_curator.stages.text.filters.heuristic.string.SubstringFilter.keep_document(
score: int
) -> bool
nemo_curator.stages.text.filters.heuristic.string.SubstringFilter.score_document(
text: str
) -> int
class nemo_curator.stages.text.filters.heuristic.string.SymbolsToWordsFilter(
max_symbol_to_word_ratio: float = 0.1,
lang: str = 'en'
)

Bases: DocumentFilter

Remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_name
= 'symbol_to_word'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.string.SymbolsToWordsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.SymbolsToWordsFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.UrlsFilter(
max_url_to_text_ratio: float = 0.2
)

Bases: DocumentFilter

If more than 20% of the document is comprised of URLs, then discard.

_name
= 'urls_ratio'
nemo_curator.stages.text.filters.heuristic.string.UrlsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.UrlsFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.WhiteSpaceFilter(
max_white_space_ratio: float = 0.25
)

Bases: DocumentFilter

If the document contains a significant number of white space characters, then discard.

_name
= 'white_space'
nemo_curator.stages.text.filters.heuristic.string.WhiteSpaceFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.WhiteSpaceFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.WordCountFilter(
min_words: int = 50,
max_words: int = 100000,
lang: str = 'en'
)

Bases: DocumentFilter

If a document contains a number of words not within a specified range, then discard.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_name
= 'word_count'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.string.WordCountFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.WordCountFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.string.WordsWithoutAlphabetsFilter(
min_words_with_alphabets: float = 0.8,
lang: str = 'en'
)

Bases: DocumentFilter

80% of words in a document must contain at least one alphabetic character. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_name
= 'words_without_alphabets'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.string.WordsWithoutAlphabetsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.string.WordsWithoutAlphabetsFilter.score_document(
text: str
) -> float