filters.code#

Module Contents#

Classes#

AlphaFilter

This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)

GeneralCommentToCodeFilter

An abstract base class for text-based document filters.

HTMLBoilerplateFilter

This filter tries to identify HTML that is largely boilerplate.

NumberOfLinesOfCodeFilter

An abstract base class for text-based document filters.

PerExtensionFilter

This filter that has specific conditions depending on the file extension.

PythonCommentToCodeFilter

An abstract base class for text-based document filters.

TokenizerFertilityFilter

An abstract base class for text-based document filters.

XMLHeaderFilter

This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)

API#

class filters.code.AlphaFilter(min_alpha_ratio: float = 0.25)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)

Initialization

keep_document(score: float) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.code.GeneralCommentToCodeFilter(
language: str,
min_comment_to_code_ratio: float = 0.01,
max_comment_to_code_ratio: float = 0.85,
)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

Does not include the comment characters (// or /**/) towards the length of the comment. Args: language: Mime string of language

keep_document(score: float) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.code.HTMLBoilerplateFilter(
min_lang_content_ratio: float = 0.2,
min_lang_content_num_chars: int = 100,
)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

This filter tries to identify HTML that is largely boilerplate.

Initialization

keep_document(score: float) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float | None#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.code.NumberOfLinesOfCodeFilter(min_lines: int = 10, max_lines: int = 20000)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

keep_document(score: int) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) int#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.code.PerExtensionFilter(
lang: str,
extension: str,
metadata_file: str = 'code_meta.csv',
)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

This filter that has specific conditions depending on the file extension.

Initialization

keep_document(score: float | None) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float#

Filter files based on line length and % alphanumeric characters. The filtering parameters depend on the file extension, given by ext_to_filter

class filters.code.PythonCommentToCodeFilter(
min_comment_to_code_ratio: float = 0.01,
max_comment_to_code_ratio: float = 0.85,
)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

keep_document(score: float) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.code.TokenizerFertilityFilter(
path_to_tokenizer: str | None = None,
min_char_to_token_ratio: float = 2.5,
)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

keep_document(score: float) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.code.XMLHeaderFilter(char_prefix_search_length: int = 100)#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)

Initialization

keep_document(score: float) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(source: str) float#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.