filters.bitext_filter#
Module Contents#
Classes#
| A base class for bitext filter objects (such as length ratio, QE filter) on bitext.
Different from  | 
API#
- class filters.bitext_filter.BitextFilter(
- src_field: str = 'src',
- tgt_field: str = 'tgt',
- metadata_fields: list[str] | str | None = None,
- metadata_field_name_mapping: dict[str, str] | None = None,
- score_field: str | None = None,
- score_type: type | str | None = None,
- invert: bool = False,
- Bases: - abc.ABC- A base class for bitext filter objects (such as length ratio, QE filter) on bitext. Different from - ParallelScoreFilter, these filters require looking at both source AND target side of the bitext to compute a score.- This is roughly equivalent to a - ScoreFilterwrapping over a- DocumentFilterobject. But aside from operating on- ParallelDatasetinstead of- DocumentDataset, it comes with some other differences:- It discarded the ScoreFilter/DocumentFilter hierarchy. So filter classes can directly be used instead of being wrapped by ScoreFilter. 
- Unlike an DocumentFilter object, it allows passing extra metadata information into the scoring function. 
 - Initialization - Args: src_field (str, optional): The field the source documents will be read from. Defaults to “src”. tgt_field (str, optional): The field the target documents will be read from. Defaults to “tgt”. metadata_fields (Union[List[str], str], optional): Name of the metadata fields in case fields other than source and target documents need to be accessed. Defaults to []. metadata_field_name_mapping (Dict[str, str], optional): Mapping of field names in the data to argument names in - _score_bitextfunction, in case they are different. For example, if a field is called “src” in the data but should be passed to an argument called “source” in- _score_bitextfunction, you should add an entry- {"src": "source"}. Identity map is assumed if a mapping is not specified for a field name. Default to {}. score_field (Optional[str], optional): The field to which the scores will be written. If None, scores will be immediately discarded after use. Defaults to None. score_type (Union[type, str], optional): The datatype of the score that will be made for each document. Defaults to None. invert (bool, optional): If True, will keep all documents that are normally discarded. Defaults to False.- Raises: ValueError: If length of source and target fields are different. - abstract keep_bitext(**kwargs) bool#
 - abstract score_bitext(
- src: pandas.Series,
- tgt: pandas.Series,
- **kwargs,
- Scoring function for the bitext.