nemo_curator.stages.text.filters.doc_filter
nemo_curator.stages.text.filters.doc_filter
Module Contents
Classes
API
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Parameters:
The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool
True if the document should be kept, False otherwise.
Raises:
NotImplementedError: If the method is not implemented in a subclass.
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Parameters:
The text content of the document to be scored.
Returns: float | list[int | float]
A score or set of scores representing the document’s
Raises:
NotImplementedError: If the method is not implemented in a subclass.