filters.classifier_filter
#
Module Contents#
Classes#
An abstract base class for text-based document filters. |
|
An abstract base class for text-based document filters. |
|
(Bitext filter) Use a Quality Estimation (QE) model to score individual segments and filter based on estimated quality score. (reference: https://arxiv.org/pdf/2311.05350) |
API#
- class filters.classifier_filter.FastTextLangId(
- model_path: str | None = None,
- min_langid_score: float = 0.3,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(score: float) bool #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(df: pandas.Series) pandas.Series #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.classifier_filter.FastTextQualityFilter(
- model_path: str | None = None,
- label: str = '__label__hq',
- alpha: float = 3,
- seed: int = 42,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
An abstract base class for text-based document filters.
This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.
Initialization
- keep_document(df: pandas.Series) pandas.Series #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(df: pandas.Series) pandas.Series #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.classifier_filter.QualityEstimationFilter(
- model_name: str,
- cutoff: float,
- mode: str = 'always_en_x',
- gpu: bool = False,
- **kwargs,
Bases:
nemo_curator.filters.bitext_filter.BitextFilter
(Bitext filter) Use a Quality Estimation (QE) model to score individual segments and filter based on estimated quality score. (reference: https://arxiv.org/pdf/2311.05350)
Initialization
Args: model_name (type): Name of the model, as listed in the
SUPPORTED_MODELS
variable. cutoff (type): A cut-off threshold for filtering. All segments with scores lower than this threshold will be tossed away. mode (str, optional): See_score_document_with_qe
for definition. Defaults to “always_en_x”. gpu (bool, optional): Whether to use GPU. Defaults to False.Raises: NotImplementedError: If a model name outside the supported model list is passed.
- SUPPORTED_MODELS: Final[dict[str, type[nemo_curator.filters.models.qe_models.QEModel]]]#
None
- keep_bitext(score: float) bool #
Decides whether a single document should be retained according to a threshold of estimated quality score.
- score_bitext(
- src: pandas.Series,
- tgt: pandas.Series,
- src_lang: pandas.Series,
- tgt_lang: pandas.Series,
Wrapper function that scores documents in a data frame. Most work is done in
_score_document_with_qe
.Args: Takes two metadata fields:
src_lang
andtgt_lang
. Refer to_score_bitext_with_qe
function for details.Raises: RuntimeError: If input data frame arguments doesn’t have the same length.
Returns: pd.Series: A list of float scores corresponding to the individual score of each documents.