nemo_curator.stages.text.filters.heuristic.repetition.repetition

View as Markdown

Module Contents

Classes

NameDescription
RepeatedLinesByCharFilterIf the document shrinks by > 20% in terms of number of lines
RepeatedLinesFilterIf the document shrinks by > 30% in terms of number of lines after
RepeatedParagraphsByCharFilterIf the document shrinks by > 10% in terms of number of lines after
RepeatedParagraphsFilterIf the document shrinks by > 30% in terms of number of lines after
RepeatingDuplicateNGramsFilterIf the document shrinks by > x% in terms of number of characters
RepeatingTopNGramsFilterIf the document shrinks by > x% in terms of number of characters after

API

class nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedLinesByCharFilter(
max_repeated_lines_char_ratio: float = 0.8
)

Bases: DocumentFilter

If the document shrinks by > 20% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)

_name
= 'repeated_lines_char'
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedLinesByCharFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedLinesByCharFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedLinesFilter(
max_repeated_line_fraction: float = 0.7
)

Bases: DocumentFilter

If the document shrinks by > 30% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)

_name
= 'repeated_lines'
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedLinesFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedLinesFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedParagraphsByCharFilter(
max_repeated_paragraphs_char_ratio: float = 0.8
)

Bases: DocumentFilter

If the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)

_name
= 'repeated_paragraphs_char'
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedParagraphsByCharFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedParagraphsByCharFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedParagraphsFilter(
max_repeated_paragraphs_ratio: float = 0.7
)

Bases: DocumentFilter

If the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)

_name
= 'repeated_paragraphs'
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedParagraphsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatedParagraphsFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatingDuplicateNGramsFilter(
n: int = 2,
max_repeating_duplicate_ngram_ratio: float = 0.2,
lang: str = 'en'
)

Bases: DocumentFilter

If the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams, then discard. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_max_ratio
= 1.0
_name
= f'repeating_dup_{n}gram'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatingDuplicateNGramsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatingDuplicateNGramsFilter.score_document(
text: str
) -> float
class nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatingTopNGramsFilter(
n: int = 2,
max_repeating_ngram_ratio: float = 0.2,
lang: str = 'en'
)

Bases: DocumentFilter

If the document shrinks by > x% in terms of number of characters after removing the top n-grams, then discard. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

_max_ratio
= 1.0
_name
= f'repeating_top_{n}grams'
_word_splitter
= get_word_splitter(lang)
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatingTopNGramsFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.repetition.repetition.RepeatingTopNGramsFilter.score_document(
text: str
) -> float