Filters#

Base Class#

class nemo_curator.filters.DocumentFilter#

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

abstract score_document(text: str) → float | list[int | float]#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

abstract keep_document( scores: float | list[int | float], ) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

property backend: Literal['pandas', 'cudf', 'any']#: The dataframe backend the filter operates on. Can be ‘pandas’, ‘cudf’, or ‘any’. Defaults to ‘pandas’. :returns: A string representing the dataframe backend the filter needs as input :rtype: str

class nemo_curator.filters.BitextFilter( src_field: str = 'src', tgt_field: str = 'tgt', metadata_fields: list[str] | str | None = None, metadata_field_name_mapping: dict[str, str] | None = None, score_field: str | None = None, score_type: type | str | None = None, invert: bool = False, )#

A base class for bitext filter objects (such as length ratio, QE filter) on bitext. Different from ParallelScoreFilter, these filters require looking at both source AND target side of the bitext to compute a score.

This is roughly equivalent to a ScoreFilter wrapping over a DocumentFilter object. But aside from operating on ParallelDataset instead of DocumentDataset, it comes with some other differences:

It discarded the ScoreFilter/DocumentFilter hierarchy. So filter classes can directly be used instead of being wrapped by ScoreFilter.
Unlike an DocumentFilter object, it allows passing extra metadata information into the scoring function.

abstract score_bitext(

src: pandas.Series,

tgt: pandas.Series,

**kwargs,

) → pandas.Series#: Scoring function for the bitext.

nemo_curator.filters.import_filter( filter_path: str, ) → DocumentFilter | BitextFilter#

Imports a filter under nemo_curator.filters given the module path

Parameters:: filter_path (str) – The path to the filter in the form of “nemo_curator.filters.filter_name”
Returns:: The filter that is at the given path
Return type:: DocumentFilter
Raises:: ValueError – If the filter_path does not point to a DocumentFilter

Modules#

class nemo_curator.ScoreFilter( filter_obj: DocumentFilter, text_field: str = 'text', score_field: str | None = None, score_type: type | str | None = None, invert: bool = False, )#

The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.

__init__( filter_obj: DocumentFilter, text_field: str = 'text', score_field: str | None = None, score_type: type | str | None = None, invert: bool = False, )#

Constructs a ScoreFilter module.

Parameters:

filter_obj (DocumentFilter) – The score function that takes in a document string and outputs a score for the document.
text_field (str) – The field the documents will be read from.
score_field – The field to which the scores will be written. If None, scores will be immediately discarded after use.
score_type (Union[type, str]) – The datatype of the score that will be made for each document.
invert (bool) – If True, will keep all documents that are normally discarded.

call( dataset: DocumentDataset, ) → DocumentDataset#

Scores and filters all records in the dataset

Parameters:: dataset (DocumentDataset) – The dataset to apply the module to
Returns:: A dataset with the score and filter applied
Return type:: DocumentDataset

compute_filter_mask( dataset: DocumentDataset, ) → pandas.Series | pandas.DataFrame#

Compute the bool mask to filter the dataset.

Parameters:: dataset (DocumentDataset) – The dataset to compute filter mask on.
Returns:: A mask corresponding to each data instance indicating whether it will be retained.
Return type:: Series or DataFrame

class nemo_curator.Score( score_fn: Callable | DocumentFilter, score_field: str, text_field: str = 'text', score_type: type | str | None = None, )#

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.

Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.

__init__( score_fn: Callable | DocumentFilter, score_field: str, text_field: str = 'text', score_type: type | str | None = None, )#

Constructs a Score module.

Parameters:

score_fn (Callable | DocumentFilter) – The score function or the DocumentFilter object. If it is a DocumentFilter object, the score_fn will be the score_document method of the DocumentFilter.
score_field (str) – The field the score will be stored in.
text_field (str) – The field the documents will be read from.
score_type (Union[type, str]) – The datatype of the score that will be made for each document.

call( dataset: DocumentDataset, ) → DocumentDataset#

Applies the scoring to a dataset

Parameters:: dataset (DocumentDataset) – The dataset to apply the module to
Returns:: A dataset with the new score
Return type:: DocumentDataset

class nemo_curator.Filter( filter_fn: Callable | DocumentFilter, filter_field: str, invert: bool = False, )#

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

__init__( filter_fn: Callable | DocumentFilter, filter_field: str, invert: bool = False, )#

Constructs a Filter module

Parameters:

filter_fn (Callable | DocumentFilter) – A function that returns True if the document is to be kept or a DocumentFilter object,
DocumentFilter. (in which case the filter_fn will be the keep_document method of the)
filter_field (str) – The field(s) to be passed into the filter function.
invert (bool) – Whether to invert the filter condition.

call( dataset: DocumentDataset, ) → DocumentDataset#

Applies the filtering to a dataset

Parameters:: dataset (DocumentDataset) – The dataset to apply the module to
Returns:: A dataset with entries removed according to the filter
Return type:: DocumentDataset

compute_filter_mask( dataset: DocumentDataset, ) → pandas.Series | pandas.DataFrame#

Compute the bool mask to filter the dataset.

Parameters:: dataset (DocumentDataset) – The dataset to compute filter mask on.
Returns:: A mask corresponding to each data instance indicating whether it will be retained.
Return type:: Series or DataFrame

FastText Filters#

class nemo_curator.filters.FastTextLangId( model_path: str | None = None, min_langid_score: float = 0.3, )#

score_document(df: pandas.Series) → pandas.Series#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.FastTextQualityFilter( model_path: str | None = None, label: str = '__label__hq', alpha: float = 3, seed: int = 42, )#

score_document( df: pandas.Series, ) → pandas.Series#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document( df: pandas.Series, ) → pandas.Series#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

Quality Estimation Filters#

class nemo_curator.filters.QualityEstimationFilter(

model_name: str,

cutoff: float,

mode: str = 'always_en_x',

gpu: bool = False,

**kwargs,

)#

(Bitext filter) Use a Quality Estimation (QE) model to score individual segments and filter based on estimated quality score. (reference: https://arxiv.org/pdf/2311.05350)

score_bitext( src: pandas.Series, tgt: pandas.Series, src_lang: pandas.Series, tgt_lang: pandas.Series, ) → pandas.Series#

Wrapper function that scores documents in a data frame. Most work is done in _score_document_with_qe.

Parameters:: fields (Takes two metadata) – src_lang and tgt_lang. Refer to _score_bitext_with_qe function for details.
Raises:: RuntimeError – If input data frame arguments doesn’t have the same length.
Returns:: A list of float scores corresponding to the individual score of each documents.
Return type:: pd.Series

keep_bitext(score: float) → bool#: Decides whether a single document should be retained according to a threshold of estimated quality score.

Heuristic Filters#

class nemo_curator.filters.NonAlphaNumericFilter( max_non_alpha_numeric_to_text_ratio: float = 0.25, )#

If more than 25% of the document is non-alphanumeric, then discard. Intended to be applied only to English text. Source: Adapted from Gopher (Rae et al., 2021)

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.SymbolsToWordsFilter( max_symbol_to_word_ratio: float = 0.1, lang: str = 'en', )#

Remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.NumbersFilter(max_number_to_text_ratio: float = 0.15)#

If more than 15% of the document contains numbers, then discard.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.UrlsFilter(max_url_to_text_ratio: float = 0.2)#

If more than 20% of the document is comprised of URLs, then discard.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.BulletsFilter(max_bullet_lines_ratio: float = 0.9)#

If more than 90% of the lines start with a bullet, then discard. Source: Gopher (Rae et al., 2021)

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.WhiteSpaceFilter(max_white_space_ratio: float = 0.25)#

If the document contains a significant number of white space characters, then discard.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.ParenthesesFilter(max_parentheses_ratio: float = 0.1)#

If more than 10% of the sentence is in parentheses, then discard.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.LongWordFilter(max_word_length: int = 1000, lang: str = 'en')#

If the document contains a word longer than 1000 characters, then discard. NOTE: This seems to be catching things like minified .js files that don’t have spaces anywhere. Source: C4 (Google)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.WordCountFilter( min_words: int = 50, max_words: int = 100000, lang: str = 'en', )#

If a document contains a number of words not within a specified range, then discard.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.BoilerPlateStringFilter( remove_if_at_top_or_bottom: bool = True, max_boilerplate_string_ratio: float = 0.4, )#

If more than 40% of paragraphs contain boilerplate strings, then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.MeanWordLengthFilter( min_mean_word_length: int = 3, max_mean_word_length: int = 10, lang: str = 'en', )#

If the mean word length is not in a specified range, then discard.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedLinesFilter(max_repeated_line_fraction: float = 0.7)#

If the document shrinks by > 30% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedParagraphsFilter(max_repeated_paragraphs_ratio: float = 0.7)#

If the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedLinesByCharFilter(max_repeated_lines_char_ratio: float = 0.8)#

If the document shrinks by > 20% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedParagraphsByCharFilter( max_repeated_paragraphs_char_ratio: float = 0.8, )#

If the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatingTopNGramsFilter( n: int = 2, max_repeating_ngram_ratio: float = 0.2, lang: str = 'en', )#

If the document shrinks by > x% in terms of number of characters after removing the top n-grams, then discard. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatingDuplicateNGramsFilter( n: int = 2, max_repeating_duplicate_ngram_ratio: float = 0.2, lang: str = 'en', )#

If the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams, then discard. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.PunctuationFilter( max_num_sentences_without_endmark_ratio: float = 0.85, )#

If more than 85% of the sentences do not end with a punctuation mark, then discard. Source: Google C4 processing

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.EllipsisFilter(max_num_lines_ending_with_ellipsis_ratio: float = 0.3)#

If more than 30% of the sentences end with an elipsis, then discard. Source: Google C4 processing

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.CommonEnglishWordsFilter( min_num_common_words: int = 2, stop_at_false: bool = True, )#

If the sentence contains at least 2 common English words, then keep it. NOTE: We purposefully check for the lowercase versions of those common words to remove documents with over-capitalization.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → int#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: int) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.WordsWithoutAlphabetsFilter( min_words_with_alphabets: float = 0.8, lang: str = 'en', )#

80% of words in a document must contain at least one alphabetic character. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other languages, such as English, we assume words are separated by spaces.

score_document(text: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.PornographicUrlsFilter#

Check if any of the URLs within the document point to pornography.

score_document(text: str) → int#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: int) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.HistogramFilter( lang: str | None = 'en', threshold: float | None = 0.8, cache_dir: str | None = '', threshold_char: str | None = ']', )#

Histogram filter used by the NLLB paper (https://arxiv.org/pdf/2207.04672). See p30 for details.

The high-level idea of histogram filter can be described as a cheap version of language ID. Basically, it checks what ratio of characters in the data instance are included in the character historgrams collected from trusted data in the corresponding language. If the ratio is too low, then there is a good chance that there is a language ID mismatch and the data instance should be discarded.

Written with reference to the original fairseq implementation at: facebookresearch/fairseq.

score_document(text: str) → float#

Compute histogram token ratio of a text data instance according to the loaded histogram.

Parameters:: text (str) – Text data instance.
Returns:: Ratio of tokens included in the histogram.
Return type:: float

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.LengthRatioFilter(

max_ratio: float = 3.0,

src_lang: str = 'en',

tgt_lang: str = 'en',

**kwargs,

)#

(Bitext filter) Length ratio filter for bitext, similar to the one implemented in Moses toolkit (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl).

If the ratio between source and target tokens is not within a specified range then discard. Either direction (src/tgt, tgt/src) is considered.

score_bitext(src: str, tgt: str) → float#

Tokenize the source and target sentences and compute length ratio.

Parameters:

src (str) – Source document string.
tgt (str) – Target document string.

Returns:

The maximum ratio among the two translation directions of the bitext.

Return type:

float

keep_bitext(score: float) → bool#: Decides whether a single document should be retained according to the computed length ratio.

class nemo_curator.filters.TokenCountFilter( tokenizer: transformers.AutoTokenizer, min_tokens: int = 0, max_tokens: int = inf, )#

If the document contains more or less than a specified number of tokens, then discard.

score_document(text: str) → int#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: int) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.SubstringFilter( substring: str, position: Literal['prefix', 'suffix', 'any'], )#

Keeps documents that contain a substring in a given position. Gives a score of 1 if the substring is found in the given position, otherwise 0.

score_document(text: str) → int#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: int) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

Code Filters#

class nemo_curator.filters.PythonCommentToCodeFilter( min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85, )#

score_document(source: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.GeneralCommentToCodeFilter( language: str, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85, )#

score_document(source: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.NumberOfLinesOfCodeFilter(min_lines: int = 10, max_lines: int = 20000)#

score_document(source: str) → int#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: int) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.TokenizerFertilityFilter( path_to_tokenizer: str | None = None, min_char_to_token_ratio: float = 2.5, )#

score_document(source: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.XMLHeaderFilter(char_prefix_search_length: int = 100)#

This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)

score_document(source: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.AlphaFilter(min_alpha_ratio: float = 0.25)#

This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)

score_document(source: str) → float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.HTMLBoilerplateFilter( min_lang_content_ratio: float = 0.2, min_lang_content_num_chars: int = 100, )#

This filter tries to identify HTML that is largely boilerplate.

score_document(source: str) → float | None#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:: text (str) – The text content of the document to be scored.
Returns:: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Return type:: Any
Raises:: NotImplementedError – If the method is not implemented in a subclass.

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.PerExtensionFilter( lang: str, extension: str, metadata_file: str = 'code_meta.csv', )#

This filter that has specific conditions depending on the file extension.

score_document(source: str) → float#: Filter files based on line length and % alphanumeric characters. The filtering parameters depend on the file extension, given by ext_to_filter

keep_document(score: float | None) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:: scores (float | list[int | float]) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns:: True if the document should be kept, False otherwise.
Return type:: bool
Raises:: NotImplementedError – If the method is not implemented in a subclass.