`filters.classifier_filter`#

Module Contents#

Classes#

`FastTextLangId`	An abstract base class for text-based document filters.
`FastTextQualityFilter`	An abstract base class for text-based document filters.
`QualityEstimationFilter`	(Bitext filter) Use a Quality Estimation (QE) model to score individual segments and filter based on estimated quality score. (reference: https://arxiv.org/pdf/2311.05350)

API#

class filters.classifier_filter.FastTextLangId( model_path: str | None = None, min_langid_score: float = 0.3, )#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

keep_document(score: float) → bool#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(df: pandas.Series) → pandas.Series#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.classifier_filter.FastTextQualityFilter( model_path: str | None = None, label: str = '__label__hq', alpha: float = 3, seed: int = 42, )#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

Initialization

keep_document(df: pandas.Series) → pandas.Series#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(df: pandas.Series) → pandas.Series#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.classifier_filter.QualityEstimationFilter(

model_name: str,

cutoff: float,

mode: str = 'always_en_x',

gpu: bool = False,

**kwargs,

)#

Bases: nemo_curator.filters.bitext_filter.BitextFilter

(Bitext filter) Use a Quality Estimation (QE) model to score individual segments and filter based on estimated quality score. (reference: https://arxiv.org/pdf/2311.05350)

Initialization

Args: model_name (type): Name of the model, as listed in the SUPPORTED_MODELS variable. cutoff (type): A cut-off threshold for filtering. All segments with scores lower than this threshold will be tossed away. mode (str, optional): See _score_document_with_qe for definition. Defaults to “always_en_x”. gpu (bool, optional): Whether to use GPU. Defaults to False.

Raises: NotImplementedError: If a model name outside the supported model list is passed.

SUPPORTED_MODELS: Final[dict[str, type[nemo_curator.filters.models.qe_models.QEModel]]]#: None

keep_bitext(score: float) → bool#: Decides whether a single document should be retained according to a threshold of estimated quality score.

score_bitext( src: pandas.Series, tgt: pandas.Series, src_lang: pandas.Series, tgt_lang: pandas.Series, ) → pandas.Series#

Wrapper function that scores documents in a data frame. Most work is done in _score_document_with_qe.

Args: Takes two metadata fields: src_lang and tgt_lang. Refer to _score_bitext_with_qe function for details.

Raises: RuntimeError: If input data frame arguments doesn’t have the same length.

Returns: pd.Series: A list of float scores corresponding to the individual score of each documents.

filters.classifier_filter#

Module Contents#

Classes#

API#

`filters.classifier_filter`#