nemo_curator.stages.text.filters.heuristic.repetition.repetition
nemo_curator.stages.text.filters.heuristic.repetition.repetition
Module Contents
Classes
API
Bases: DocumentFilter
If the document shrinks by > 20% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)
Bases: DocumentFilter
If the document shrinks by > 30% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)
Bases: DocumentFilter
If the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)
Bases: DocumentFilter
If the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)
Bases: DocumentFilter
If the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams, then discard. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Bases: DocumentFilter
If the document shrinks by > x% in terms of number of characters after removing the top n-grams, then discard. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.