filters.doc_filter#

Module Contents#

Classes#

DocumentFilter

An abstract base class for text-based document filters.

Functions#

import_filter

Imports a filter under nemo_curator.filters given the module path

API#

class filters.doc_filter.DocumentFilter#

Bases: abc.ABC

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

property backend: Literal[pandas, cudf, any]#

The dataframe backend the filter operates on. Can be ‘pandas’, ‘cudf’, or ‘any’. Defaults to ‘pandas’. Returns: str: A string representing the dataframe backend the filter needs as input

abstractmethod keep_document(scores: float | list[int | float]) bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

property name: str#
property ngrams: dict#
property paragraphs: list#
abstractmethod score_document(text: str) float | list[int | float]#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

property sentences: list#
filters.doc_filter.import_filter(
filter_path: str,
) filters.doc_filter.DocumentFilter | nemo_curator.filters.bitext_filter.BitextFilter#

Imports a filter under nemo_curator.filters given the module path

Args: filter_path (str): The path to the filter in the form of “nemo_curator.filters.filter_name”

Returns: DocumentFilter: The filter that is at the given path

Raises: ValueError: If the filter_path does not point to a DocumentFilter