Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Filters
Base Class
- class nemo_curator.filters.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
- abstract score_document(text: str) Any
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- abstract keep_document(scores: Any) bool
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- nemo_curator.filters.import_filter(filter_path: str) nemo_curator.filters.doc_filter.DocumentFilter
Imports a filter under nemo_curator.filters given the module path
- Parameters
filter_path (str) – The path to the filter in the form of “nemo_curator.filters.filter_name”
- Returns
The filter that is at the given path
- Return type
- Raises
ValueError – If the filter_path does not point to a DocumentFilter
Modules
- class nemo_curator.ScoreFilter(filter_obj: nemo_curator.filters.doc_filter.DocumentFilter, text_field: str = 'text', score_field: Optional[str] = None, score_type: Optional[Union[type, str]] = None, invert: bool = False)
The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.
The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.
- __call__(dataset: nemo_curator.datasets.doc_dataset.DocumentDataset) nemo_curator.datasets.doc_dataset.DocumentDataset
Scores and filters all records in the dataset
- Parameters
dataset (DocumentDataset) – The dataset to apply the module to
- Returns
A dataset with the score and filter applied
- Return type
- __init__(filter_obj: nemo_curator.filters.doc_filter.DocumentFilter, text_field: str = 'text', score_field: Optional[str] = None, score_type: Optional[Union[type, str]] = None, invert: bool = False)
Constructs a ScoreFilter module.
- Parameters
filter_obj (DocumentFilter) – The score function that takes in a document string and outputs a score for the document.
text_field (str) – The field the documents will be read from.
score_field – The field to which the scores will be written. If None, scores will be immediately discarded after use.
score_type (Union[type, str]) – The datatype of the score that will be made for each document.
invert (bool) – If True, will keep all documents that are normally discarded.
- class nemo_curator.Score(score_fn: Callable, score_field: str, text_field: str = 'text', score_type: Optional[Union[type, str]] = None)
The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score.
Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.
- __call__(dataset: nemo_curator.datasets.doc_dataset.DocumentDataset) nemo_curator.datasets.doc_dataset.DocumentDataset
Applies the scoring to a dataset
- Parameters
dataset (DocumentDataset) – The dataset to apply the module to
- Returns
A dataset with the new score
- Return type
- __init__(score_fn: Callable, score_field: str, text_field: str = 'text', score_type: Optional[Union[type, str]] = None)
Constructs a Score module.
- Parameters
score_fn (Callable) – The score function that takes in a document string and outputs a score for the document.
score_field (str) – The field the score will be stored in.
text_field (str) – The field the documents will be read from.
score_type (Union[type, str]) – The datatype of the score that will be made for each document.
- class nemo_curator.Filter(filter_fn: Callable, filter_field: str, invert: bool = False)
The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept.
Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.
- __call__(dataset: nemo_curator.datasets.doc_dataset.DocumentDataset) nemo_curator.datasets.doc_dataset.DocumentDataset
Applies the filtering to a dataset
- Parameters
dataset (DocumentDataset) – The dataset to apply the module to
- Returns
A dataset with entries removed according to the filter
- Return type
- __init__(filter_fn: Callable, filter_field: str, invert: bool = False)
Constructs a Filter module
- Parameters
filter_fn (Callable) – A function that returns True if the document is to be kept.
filter_field (str) – The field(s) to be passed into the filter function.
invert (bool) – Whether to invert the filter condition.
FastText Filters
- class nemo_curator.filters.FastTextLangId(model_path=None, min_langid_score=0.3)
- score_document(df: pandas.Series)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.FastTextQualityFilter(model_path=None, label='__label__hq', alpha=3, seed=42)
- score_document(df: pandas.Series)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(df: pandas.Series)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
Heuristic Filters
- class nemo_curator.filters.NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio: float = 0.25)
If more than 25% of the document is non-alphanumeric then discard Intended to be applied only too english text Source: Adapted from Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.SymbolsToWordsFilter(max_symbol_to_word_ratio=0.1, lang='en')
Remove any document with symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.NumbersFilter(max_number_to_text_ratio=0.15)
If more than 15% of the document contains numbers then discard
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.UrlsFilter(max_url_to_text_ratio=0.2)
If more than 20% of the document is comprised of URLs then discard
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.BulletsFilter(max_bullet_lines_ratio=0.9)
If more than 90% of the lines start with a bullet then discard Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.WhiteSpaceFilter(max_white_space_ratio=0.25)
If the document contains a significant number of white space characters then discard
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.ParenthesesFilter(max_parentheses_ratio=0.1)
If more than 10% of the sentence is in parentheses then discard
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.LongWordFilter(max_word_length=1000, lang='en')
If the document contains a word longer than 1000 characters then discard NOTE: This seems to be catching things like minified .js files that don’t have spaces anywhere. Source: C4 (Google)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.WordCountFilter(min_words=50, max_words=100000, lang='en')
If a document contains a number of words not within a specified range then discard
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.BoilerPlateStringFilter(remove_if_at_top_or_bottom=True, max_boilerplate_string_ratio=0.4)
If more than 40% of paragraphs contain boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.MeanWordLengthFilter(min_mean_word_length=3, max_mean_word_length=10, lang='en')
If the mean word length is not in a specified range then discard
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.RepeatedLinesFilter(max_repeated_line_fraction=0.7)
If the document shrinks by > 30% in terms of number of lines after removing duplicate lines then discard Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.RepeatedParagraphsFilter(max_repeated_paragraphs_ratio=0.7)
If the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs then discard. Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.RepeatedLinesByCharFilter(max_repeated_lines_char_ratio=0.8)
If the document shrinks by > 20% in terms of number of lines after removing duplicate lines then discard Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.RepeatedParagraphsByCharFilter(max_repeated_paragraphs_char_ratio=0.8)
If the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs then discard. Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2, lang='en')
If the document shrinks by > x% in terms of number of characters after removing the top n-grams then discard. Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.RepeatingDuplicateNGramsFilter(n=2, max_repeating_duplicate_ngram_ratio=0.2, lang='en')
If the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams then discard. Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85)
If more than 85% of the sentences do not end with a punctuation mark then discard. Source: Google C4 processing
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.EllipsisFilter(max_num_lines_ending_with_ellipsis_ratio=0.3)
If more than 30% of the sentences end with an elipsis then discard. Source: Google C4 processing
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.CommonEnglishWordsFilter(min_num_common_words=2, stop_at_false=True)
If the sentence contains at least 2 common english words, keep NOTE: we purposefully check for the lowercase versions of those common words to remove documents with over-capitalization.
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.WordsWithoutAlphabetsFilter(min_words_with_alphabets=0.8, lang='en')
80% of words in a document must contain at least one alphabetic character Source: Gopher (Rae et al., 2021)
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.PornographicUrlsFilter
Check if any of the urls within the document point to porn
- score_document(text)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
Code Filters
- class nemo_curator.filters.PythonCommentToCodeFilter(min_comment_to_code_ratio=0.01, max_comment_to_code_ratio=0.85)
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.GeneralCommentToCodeFilter(language, min_comment_to_code_ratio=0.01, max_comment_to_code_ratio=0.85)
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.NumberOfLinesOfCodeFilter(min_lines=10, max_lines=20000)
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.TokenizerFertilityFilter(path_to_tokenizer=None, min_char_to_token_ratio=2.5)
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.XMLHeaderFilter(char_prefix_search_length=100)
This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.AlphaFilter(min_alpha_ratio=0.25)
This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.HTMLBoilerplateFilter(min_lang_content_ratio=0.2, min_lang_content_num_chars=100)
This filter tries to identify HTML that is largely boilerplate.
- score_document(source)
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
- Parameters
text (str) – The text content of the document to be scored.
- Returns
A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
- Return type
Any
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.
- class nemo_curator.filters.PerExtensionFilter(lang, extension, metadata_file='code_meta.csv')
This filter that has specific conditions depending on the file extension.
- score_document(source)
Filter files based on line length and % alphanumeric characters. The filtering parameters depend on the file extension, given by ext_to_filter
- keep_document(score)
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
- Parameters
scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
- Returns
True if the document should be kept, False otherwise.
- Return type
bool
- Raises
NotImplementedError – If the method is not implemented in a subclass.