filters.code
#
Module Contents#
Classes#
This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161) |
|
An abstract base class for text-based document filters. |
|
This filter tries to identify HTML that is largely boilerplate. |
|
An abstract base class for text-based document filters. |
|
This filter that has specific conditions depending on the file extension. |
|
An abstract base class for text-based document filters. |
|
An abstract base class for text-based document filters. |
|
This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161) |
API#
- class filters.code.AlphaFilter(min_alpha_ratio: float = 0.25)#
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)
Initialization
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.code.GeneralCommentToCodeFilter(
- language: str,
- min_comment_to_code_ratio: float = 0.01,
- max_comment_to_code_ratio: float = 0.85,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
Does not include the comment characters (// or /**/) towards the length of the comment. Args: language: Mime string of language
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.code.HTMLBoilerplateFilter(
- min_lang_content_ratio: float = 0.2,
- min_lang_content_num_chars: int = 100,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
This filter tries to identify HTML that is largely boilerplate.
Initialization
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float | None #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.code.NumberOfLinesOfCodeFilter(min_lines: int = 10, max_lines: int = 20000)#
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(score: int) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) int #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.code.PerExtensionFilter(
- lang: str,
- extension: str,
- metadata_file: str = 'code_meta.csv',
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
This filter that has specific conditions depending on the file extension.
Initialization
- keep_document(score: float | None) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float #
Filter files based on line length and % alphanumeric characters. The filtering parameters depend on the file extension, given by
ext_to_filter
- class filters.code.PythonCommentToCodeFilter(
- min_comment_to_code_ratio: float = 0.01,
- max_comment_to_code_ratio: float = 0.85,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.code.TokenizerFertilityFilter(
- path_to_tokenizer: str | None = None,
- min_char_to_token_ratio: float = 2.5,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.code.XMLHeaderFilter(char_prefix_search_length: int = 100)#
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)
Initialization
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(source: str) float #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.