Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Filters#

Base Class#

class nemo_curator.filters.DocumentFilter#

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

abstract score_document(text: str) Any#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

abstract keep_document(scores: Any) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

nemo_curator.filters.import_filter(
filter_path: str,
) DocumentFilter#

Imports a filter under nemo_curator.filters given the module path

Parameters:

filter_path (str) – The path to the filter in the form of “nemo_curator.filters.filter_name”

Returns:

The filter that is at the given path

Return type:

DocumentFilter

Raises:

ValueError – If the filter_path does not point to a DocumentFilter

Modules#

class nemo_curator.ScoreFilter(
filter_obj: DocumentFilter,
text_field: str = 'text',
score_field: str | None = None,
score_type: type | str | None = None,
invert: bool = False,
)#

The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.

__call__(
dataset: DocumentDataset,
) DocumentDataset#

Scores and filters all records in the dataset

Parameters:

dataset (DocumentDataset) – The dataset to apply the module to

Returns:

A dataset with the score and filter applied

Return type:

DocumentDataset

__init__(
filter_obj: DocumentFilter,
text_field: str = 'text',
score_field: str | None = None,
score_type: type | str | None = None,
invert: bool = False,
)#

Constructs a ScoreFilter module.

Parameters:
  • filter_obj (DocumentFilter) – The score function that takes in a document string and outputs a score for the document.

  • text_field (str) – The field the documents will be read from.

  • score_field – The field to which the scores will be written. If None, scores will be immediately discarded after use.

  • score_type (Union[type, str]) – The datatype of the score that will be made for each document.

  • invert (bool) – If True, will keep all documents that are normally discarded.

class nemo_curator.Score(
score_fn: Callable,
score_field: str,
text_field: str = 'text',
score_type: type | str | None = None,
)#

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score.

Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.

__call__(
dataset: DocumentDataset,
) DocumentDataset#

Applies the scoring to a dataset

Parameters:

dataset (DocumentDataset) – The dataset to apply the module to

Returns:

A dataset with the new score

Return type:

DocumentDataset

__init__(
score_fn: Callable,
score_field: str,
text_field: str = 'text',
score_type: type | str | None = None,
)#

Constructs a Score module.

Parameters:
  • score_fn (Callable) – The score function that takes in a document string and outputs a score for the document.

  • score_field (str) – The field the score will be stored in.

  • text_field (str) – The field the documents will be read from.

  • score_type (Union[type, str]) – The datatype of the score that will be made for each document.

class nemo_curator.Filter(
filter_fn: Callable,
filter_field: str,
invert: bool = False,
)#

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept.

Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

__call__(
dataset: DocumentDataset,
) DocumentDataset#

Applies the filtering to a dataset

Parameters:

dataset (DocumentDataset) – The dataset to apply the module to

Returns:

A dataset with entries removed according to the filter

Return type:

DocumentDataset

__init__(
filter_fn: Callable,
filter_field: str,
invert: bool = False,
)#

Constructs a Filter module

Parameters:
  • filter_fn (Callable) – A function that returns True if the document is to be kept.

  • filter_field (str) – The field(s) to be passed into the filter function.

  • invert (bool) – Whether to invert the filter condition.

FastText Filters#

class nemo_curator.filters.FastTextLangId(model_path=None, min_langid_score=0.3)#
score_document(df: pandas.Series)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.FastTextQualityFilter(
model_path=None,
label='__label__hq',
alpha=3,
seed=42,
)#
score_document(df: pandas.Series)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(df: pandas.Series)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

Heuristic Filters#

class nemo_curator.filters.NonAlphaNumericFilter(
max_non_alpha_numeric_to_text_ratio: float = 0.25,
)#

If more than 25% of the document is non-alphanumeric, then discard. Intended to be applied only to English text. Source: Adapted from Gopher (Rae et al., 2021)

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.SymbolsToWordsFilter(max_symbol_to_word_ratio=0.1, lang='en')#

Remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the elipsis. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.NumbersFilter(max_number_to_text_ratio=0.15)#

If more than 15% of the document contains numbers, then discard.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.UrlsFilter(max_url_to_text_ratio=0.2)#

If more than 20% of the document is comprised of URLs, then discard.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.BulletsFilter(max_bullet_lines_ratio=0.9)#

If more than 90% of the lines start with a bullet, then discard. Source: Gopher (Rae et al., 2021)

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.WhiteSpaceFilter(max_white_space_ratio=0.25)#

If the document contains a significant number of white space characters, then discard.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.ParenthesesFilter(max_parentheses_ratio=0.1)#

If more than 10% of the sentence is in parentheses, then discard.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.LongWordFilter(max_word_length=1000, lang='en')#

If the document contains a word longer than 1000 characters, then discard. NOTE: This seems to be catching things like minified .js files that don’t have spaces anywhere. Source: C4 (Google)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.WordCountFilter(min_words=50, max_words=100000, lang='en')#

If a document contains a number of words not within a specified range, then discard.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.BoilerPlateStringFilter(
remove_if_at_top_or_bottom=True,
max_boilerplate_string_ratio=0.4,
)#

If more than 40% of paragraphs contain boilerplate strings, then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.MeanWordLengthFilter(
min_mean_word_length=3,
max_mean_word_length=10,
lang='en',
)#

If the mean word length is not in a specified range, then discard.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedLinesFilter(max_repeated_line_fraction=0.7)#

If the document shrinks by > 30% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedParagraphsFilter(max_repeated_paragraphs_ratio=0.7)#

If the document shrinks by > 30% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedLinesByCharFilter(max_repeated_lines_char_ratio=0.8)#

If the document shrinks by > 20% in terms of number of lines after removing duplicate lines, then discard. Source: Gopher (Rae et al., 2021)

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatedParagraphsByCharFilter(max_repeated_paragraphs_char_ratio=0.8)#

If the document shrinks by > 10% in terms of number of lines after removing duplicate paragraphs, then discard. Source: Gopher (Rae et al., 2021)

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatingTopNGramsFilter(
n=2,
max_repeating_ngram_ratio=0.2,
lang='en',
)#

If the document shrinks by > x% in terms of number of characters after removing the top n-grams, then discard. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.RepeatingDuplicateNGramsFilter(
n=2,
max_repeating_duplicate_ngram_ratio=0.2,
lang='en',
)#

If the document shrinks by > x% in terms of number of characters after removing all duplicate n-grams, then discard. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85)#

If more than 85% of the sentences do not end with a punctuation mark, then discard. Source: Google C4 processing

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.EllipsisFilter(max_num_lines_ending_with_ellipsis_ratio=0.3)#

If more than 30% of the sentences end with an elipsis, then discard. Source: Google C4 processing

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.CommonEnglishWordsFilter(min_num_common_words=2, stop_at_false=True)#

If the sentence contains at least 2 common English words, then keep it. NOTE: We purposefully check for the lowercase versions of those common words to remove documents with over-capitalization.

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.WordsWithoutAlphabetsFilter(min_words_with_alphabets=0.8, lang='en')#

80% of words in a document must contain at least one alphabetic character. Source: Gopher (Rae et al., 2021)

For Chinese and Japanese text, we use external libraries to split the text because these languages are not separated by spaces. For all other langauges, such as English, we assume words are separated by spaces.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.PornographicUrlsFilter#

Check if any of the URLs within the document point to pornography.

score_document(text)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

Code Filters#

class nemo_curator.filters.PythonCommentToCodeFilter(
min_comment_to_code_ratio=0.01,
max_comment_to_code_ratio=0.85,
)#
score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.GeneralCommentToCodeFilter(
language,
min_comment_to_code_ratio=0.01,
max_comment_to_code_ratio=0.85,
)#
score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.NumberOfLinesOfCodeFilter(min_lines=10, max_lines=20000)#
score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.TokenizerFertilityFilter(
path_to_tokenizer=None,
min_char_to_token_ratio=2.5,
)#
score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.XMLHeaderFilter(char_prefix_search_length=100)#

This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)

score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.AlphaFilter(min_alpha_ratio=0.25)#

This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)

score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.HTMLBoilerplateFilter(
min_lang_content_ratio=0.2,
min_lang_content_num_chars=100,
)#

This filter tries to identify HTML that is largely boilerplate.

score_document(source)#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text (str) – The text content of the document to be scored.

Returns:

A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Return type:

Any

Raises:

NotImplementedError – If the method is not implemented in a subclass.

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.

class nemo_curator.filters.PerExtensionFilter(lang, extension, metadata_file='code_meta.csv')#

This filter that has specific conditions depending on the file extension.

score_document(source)#

Filter files based on line length and % alphanumeric characters. The filtering parameters depend on the file extension, given by ext_to_filter

keep_document(score)#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores (Any) – The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns:

True if the document should be kept, False otherwise.

Return type:

bool

Raises:

NotImplementedError – If the method is not implemented in a subclass.