filters.bitext_filter#
Module Contents#
Classes#
A base class for bitext filter objects (such as length ratio, QE filter) on bitext.
Different from |
API#
- class filters.bitext_filter.BitextFilter(
- src_field: str = 'src',
- tgt_field: str = 'tgt',
- metadata_fields: list[str] | str | None = None,
- metadata_field_name_mapping: dict[str, str] | None = None,
- score_field: str | None = None,
- score_type: type | str | None = None,
- invert: bool = False,
Bases:
abc.ABCA base class for bitext filter objects (such as length ratio, QE filter) on bitext. Different from
ParallelScoreFilter, these filters require looking at both source AND target side of the bitext to compute a score.This is roughly equivalent to a
ScoreFilterwrapping over aDocumentFilterobject. But aside from operating onParallelDatasetinstead ofDocumentDataset, it comes with some other differences:It discarded the ScoreFilter/DocumentFilter hierarchy. So filter classes can directly be used instead of being wrapped by ScoreFilter.
Unlike an DocumentFilter object, it allows passing extra metadata information into the scoring function.
Initialization
Args: src_field (str, optional): The field the source documents will be read from. Defaults to “src”. tgt_field (str, optional): The field the target documents will be read from. Defaults to “tgt”. metadata_fields (Union[List[str], str], optional): Name of the metadata fields in case fields other than source and target documents need to be accessed. Defaults to []. metadata_field_name_mapping (Dict[str, str], optional): Mapping of field names in the data to argument names in
_score_bitextfunction, in case they are different. For example, if a field is called “src” in the data but should be passed to an argument called “source” in_score_bitextfunction, you should add an entry{"src": "source"}. Identity map is assumed if a mapping is not specified for a field name. Default to {}. score_field (Optional[str], optional): The field to which the scores will be written. If None, scores will be immediately discarded after use. Defaults to None. score_type (Union[type, str], optional): The datatype of the score that will be made for each document. Defaults to None. invert (bool, optional): If True, will keep all documents that are normally discarded. Defaults to False.Raises: ValueError: If length of source and target fields are different.
- abstract keep_bitext(**kwargs) bool#
- abstract score_bitext(
- src: pandas.Series,
- tgt: pandas.Series,
- **kwargs,
Scoring function for the bitext.