stages.text.filters.fasttext_filter#
Module Contents#
Classes#
An abstract base class for text-based document filters. |
|
An abstract base class for text-based document filters. |
API#
- class stages.text.filters.fasttext_filter.FastTextLangId(
- model_path: str | None = None,
- min_langid_score: float = 0.3,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterAn abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(score: float | str) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- load_model() None#
- model_check_or_download() None#
- score_document(text: str) list[float | str]#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class stages.text.filters.fasttext_filter.FastTextQualityFilter(
- model_path: str | None = None,
- label: str = '__label__hq',
- alpha: float = 3,
- seed: int = 42,
Bases:
nemo_curator.stages.text.filters.doc_filter.DocumentFilterAn abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(score: float) bool#
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- load_model() None#
- model_check_or_download() None#
- score_document(text: str) float#
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.