stages.text.filters.heuristic_filter#
Module Contents#
Classes#
If more than 40% of paragraphs contain boilerplate strings, then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing. |
|
If more than 90% of the lines start with a bullet, then discard. Source: Gopher (Rae et al., 2021) |
|
If the sentence contains at least 2 common English words, then keep it. NOTE: We purposefully check for the lowercase versions of those common words to remove documents with over-capitalization. |
|
If more than 30% of the sentences end with an elipsis, then discard. Source: Google C4 processing |
|
Histogram filter used by the NLLB paper (https://arxiv.org/pdf/2207.04672). See p30 for details. |
|
If the document contains a word longer than 1000 characters, then discard.
NOTE: This seems to be catching things like minified |
|
If the mean word length is not in a specified range, then discard. |
|
If more than 25% of the document is non-alphanumeric, then discard. Intended to be applied only to English text. Source: Adapted from Gopher (Rae et al., 2021) |
|
If more than 15% of the document contains numbers, then discard. |
|
If more than 10% of the sentence is in parentheses, then discard. |
|
Check if any of the URLs within the document point to pornography. |
|
If more than 85% of the sentences do not end with a punctuation mark, then discard. Source: Google C4 processing |
|
If the document shrinks by > 20% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021) |
|
If the document shrinks by > 30% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021) |
|
If the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021) |
|
If the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021) |
|
If the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams, then discard. Source: Gopher (Rae et al., 2021) |
|
If the document shrinks by > x% in terms of number of characters after removing the top n-grams, then discard. Source: Gopher (Rae et al., 2021) |
|
Keeps documents that contain a substring in a given position. Gives a score of 1 if the substring is found in the given position, otherwise 0. |
|
Remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis. Source: Gopher (Rae et al., 2021) |
|
If the document contains more or less than a specified number of tokens, then discard. |
|
If more than 20% of the document is comprised of URLs, then discard. |
|
If the document contains a significant number of white space characters, then discard. |
|
If a document contains a number of words not within a specified range, then discard. |
|
80% of words in a document must contain at least one alphabetic character. Source: Gopher (Rae et al., 2021) |
API#
- class stages.text.filters.heuristic_filter.BoilerPlateStringFilter(
- remove_if_at_top_or_bottom: bool = True,
- max_boilerplate_string_ratio: float = 0.4,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 40% of paragraphs contain boilerplate strings, then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.BulletsFilter(max_bullet_lines_ratio: float = 0.9)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 90% of the lines start with a bullet, then discard. Source: Gopher (Rae et al., 2021)
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.CommonEnglishWordsFilter(
- min_num_common_words: int = 2,
- stop_at_false: bool = True,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the sentence contains at least 2 common English words, then keep it. NOTE: We purposefully check for the lowercase versions of those common words to remove documents with over-capitalization.
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: int) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) int#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.EllipsisFilter(max_num_lines_ending_with_ellipsis_ratio: float = 0.3)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 30% of the sentences end with an elipsis, then discard. Source: Google C4 processing
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.HistogramFilter(
- lang: str | None = 'en',
- threshold: float | None = 0.8,
- cache_dir: str | None = '',
- threshold_char: str | None = ']',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterHistogram filter used by the NLLB paper (https://arxiv.org/pdf/2207.04672). See p30 for details.
The high-level idea of histogram filter can be described as a cheap version of language ID. Basically, it checks what ratio of characters in the data instance are included in the character historgrams collected from trusted data in the corresponding language. If the ratio is too low, then there is a good chance that there is a language ID mismatch and the data instance should be discarded.
Written with reference to the original fairseq implementation at: https://github.com/facebookresearch/fairseq/blob/main/examples/m2m_100/process_data/clean_histogram.py.
Initialization
Args: lang (str, optional): Expected language of the segment. This will decide which histogram will be loaded. Defaults to “en”. threshold (float, optional): Threshold for ratio of characters in the histogram. Defaults to 0.8. cache_dir (str, optional): Cache dir download histogram files. Defaults to “”. threshold_char (str, optional): Formatter character of the histogram files. You should not change this unless you rebuilt your own histogram. Defaults to “]”.
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Compute histogram token ratio of a text data instance according to the loaded histogram.
Args: text (str): Text data instance.
Returns: float: Ratio of tokens included in the histogram.
- class stages.text.filters.heuristic_filter.LongWordFilter(max_word_length: int = 1000, lang: str = 'en')#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document contains a word longer than 1000 characters, then discard. NOTE: This seems to be catching things like minified
.jsfiles that don’t have spaces anywhere. Source: C4 (Google)For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.MeanWordLengthFilter(
- min_mean_word_length: int = 3,
- max_mean_word_length: int = 10,
- lang: str = 'en',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the mean word length is not in a specified range, then discard.
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.NonAlphaNumericFilter(
- max_non_alpha_numeric_to_text_ratio: float = 0.25,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 25% of the document is non-alphanumeric, then discard. Intended to be applied only to English text. Source: Adapted from Gopher (Rae et al., 2021)
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.NumbersFilter(max_number_to_text_ratio: float = 0.15)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 15% of the document contains numbers, then discard.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.ParenthesesFilter(max_parentheses_ratio: float = 0.1)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 10% of the sentence is in parentheses, then discard.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.PornographicUrlsFilter#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterCheck if any of the URLs within the document point to pornography.
Initialization
- keep_document(score: int) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) int#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.PunctuationFilter(
- max_num_sentences_without_endmark_ratio: float = 0.85,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 85% of the sentences do not end with a punctuation mark, then discard. Source: Google C4 processing
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.RepeatedLinesByCharFilter(max_repeated_lines_char_ratio: float = 0.8)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document shrinks by > 20% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.RepeatedLinesFilter(max_repeated_line_fraction: float = 0.7)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document shrinks by > 30% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.RepeatedParagraphsByCharFilter(
- max_repeated_paragraphs_char_ratio: float = 0.8,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.RepeatedParagraphsFilter(max_repeated_paragraphs_ratio: float = 0.7)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.RepeatingDuplicateNGramsFilter(
- n: int = 2,
- max_repeating_duplicate_ngram_ratio: float = 0.2,
- lang: str = 'en',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams, then discard. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.RepeatingTopNGramsFilter(
- n: int = 2,
- max_repeating_ngram_ratio: float = 0.2,
- lang: str = 'en',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document shrinks by > x% in terms of number of characters after removing the top n-grams, then discard. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.SubstringFilter(
- substring: str,
- position: Literal[prefix, suffix, any],
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterKeeps documents that contain a substring in a given position. Gives a score of 1 if the substring is found in the given position, otherwise 0.
Initialization
Args: substring (str): The substring to check for. position (Literal[“prefix”, “suffix”, “any”]): The position of the substring.
- keep_document(score: int) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) int#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.SymbolsToWordsFilter(
- max_symbol_to_word_ratio: float = 0.1,
- lang: str = 'en',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterRemove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.TokenCountFilter(
- tokenizer: transformers.AutoTokenizer | None = None,
- hf_model_name: str | None = None,
- hf_token: str | None = None,
- min_tokens: int = 0,
- max_tokens: int = float('inf'),
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document contains more or less than a specified number of tokens, then discard.
Initialization
Args: tokenizer (AutoTokenizer | None): The pre-loaded tokenizer to use to count the tokens. If None, the tokenizer will be initialized from the hf_model_name. hf_model_name (str | None): The name of the Hugging Face model to use to count the tokens. If None, the pre-loaded tokenizer must be provided via the tokenizer argument. hf_token (str | None): The token to use to access the Hugging Face model, if needed. min_tokens (int): The minimum number of tokens the document must contain. Set to 0 to disable the minimum token count filter. max_tokens (int): The maximum number of tokens the document can contain. Set to infinity to disable the maximum token count filter.
- keep_document(score: int) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- load_tokenizer() None#
- model_check_or_download() None#
- score_document(text: str) int#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.UrlsFilter(max_url_to_text_ratio: float = 0.2)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf more than 20% of the document is comprised of URLs, then discard.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.WhiteSpaceFilter(max_white_space_ratio: float = 0.25)#
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf the document contains a significant number of white space characters, then discard.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.WordCountFilter(
- min_words: int = 50,
- max_words: int = 100000,
- lang: str = 'en',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterIf a document contains a number of words not within a specified range, then discard.
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.heuristic_filter.WordsWithoutAlphabetsFilter(
- min_words_with_alphabets: float = 0.8,
- lang: str = 'en',
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilter80% of words in a document must contain at least one alphabetic character. Source: Gopher (Rae et al., 2021)
For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.