filters.bitext_filter#

Module Contents#

Classes#

BitextFilter

A base class for bitext filter objects (such as length ratio, QE filter) on bitext. Different from ParallelScoreFilter, these filters require looking at both source AND target side of the bitext to compute a score.

API#

class filters.bitext_filter.BitextFilter(
src_field: str = 'src',
tgt_field: str = 'tgt',
metadata_fields: list[str] | str | None = None,
metadata_field_name_mapping: dict[str, str] | None = None,
score_field: str | None = None,
score_type: type | str | None = None,
invert: bool = False,
)#

Bases: abc.ABC

A base class for bitext filter objects (such as length ratio, QE filter) on bitext. Different from ParallelScoreFilter, these filters require looking at both source AND target side of the bitext to compute a score.

This is roughly equivalent to a ScoreFilter wrapping over a DocumentFilter object. But aside from operating on ParallelDataset instead of DocumentDataset, it comes with some other differences:

  • It discarded the ScoreFilter/DocumentFilter hierarchy. So filter classes can directly be used instead of being wrapped by ScoreFilter.

  • Unlike an DocumentFilter object, it allows passing extra metadata information into the scoring function.

Initialization

Args: src_field (str, optional): The field the source documents will be read from. Defaults to “src”. tgt_field (str, optional): The field the target documents will be read from. Defaults to “tgt”. metadata_fields (Union[List[str], str], optional): Name of the metadata fields in case fields other than source and target documents need to be accessed. Defaults to []. metadata_field_name_mapping (Dict[str, str], optional): Mapping of field names in the data to argument names in _score_bitext function, in case they are different. For example, if a field is called “src” in the data but should be passed to an argument called “source” in _score_bitext function, you should add an entry {"src": "source"}. Identity map is assumed if a mapping is not specified for a field name. Default to {}. score_field (Optional[str], optional): The field to which the scores will be written. If None, scores will be immediately discarded after use. Defaults to None. score_type (Union[type, str], optional): The datatype of the score that will be made for each document. Defaults to None. invert (bool, optional): If True, will keep all documents that are normally discarded. Defaults to False.

Raises: ValueError: If length of source and target fields are different.

abstractmethod keep_bitext(**kwargs) bool#
abstractmethod score_bitext(
src: pandas.Series,
tgt: pandas.Series,
**kwargs,
) pandas.Series#

Scoring function for the bitext.