modules.filter
#
Module Contents#
Classes#
The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata. |
|
Base class for all NeMo Curator modules. |
|
The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter. |
|
The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter. |
API#
- class modules.filter.Filter(
- filter_fn: collections.abc.Callable | nemo_curator.filters.DocumentFilter,
- filter_field: str,
- invert: bool = False,
Bases:
nemo_curator.modules.base.BaseModule
The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.
Initialization
Constructs a Filter module
Args: filter_fn (Callable | DocumentFilter): A function that returns True if the document is to be kept or a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. filter_field (str): The field(s) to be passed into the filter function. invert (bool): Whether to invert the filter condition.
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Applies the filtering to a dataset
Args: dataset (DocumentDataset): The dataset to apply the module to
Returns: DocumentDataset: A dataset with entries removed according to the filter
- compute_filter_mask(
- dataset: nemo_curator.datasets.DocumentDataset,
Compute the bool mask to filter the dataset.
Args: dataset (DocumentDataset): The dataset to compute filter mask on.
Returns: Series or DataFrame: A mask corresponding to each data instance indicating whether it will be retained.
- class modules.filter.ParallelScoreFilter(
- src_filter_obj: nemo_curator.filters.DocumentFilter,
- tgt_filter_obj: nemo_curator.filters.DocumentFilter,
- src_field: str = 'src',
- tgt_field: str = 'tgt',
- src_score: str | None = None,
- tgt_score: str | None = None,
- score_type: str | None = None,
- invert: bool = False,
Bases:
nemo_curator.modules.base.BaseModule
Base class for all NeMo Curator modules.
Handles validating that data lives on the correct device for each module
Initialization
A filter object wrapper class for applying monolingual filter objects on bitext. If either side of the bitext is discarded, the whole bitext pair is discarded. If you want to apply a bitext filter that takes both the source and target as input, checkout
BitextFilter
class.Note that the goal of this wrapper class is to group the same/similar filters on bitext thus making the logic clearer, which is why we force the
score_type
andinvert
to be the same among source/target filters. If you need the extra flexibility, you should fall back to applying two filters one after the other.Args: src_filter_obj (type): The score function that takes in a document string and outputs a score for the source document. tgt_filter_obj (type): The score function that takes in a document string and outputs a score for the target document. src_field (str, optional): The field the source documents will be read from. Defaults to “src”. tgt_field (str, optional): The field the target documents will be read from. Defaults to “tgt”. src_score (str, optional): The field to which the source scores will be written. If None, scores will be immediately discarded after use. Defaults to None. tgt_score (str, optional): The field to which the target scores will be written. If None, scores will be immediately discarded after use. Defaults to None. score_type (Optional[str]): The datatype of the score that will be made for each document. Defaults to None. invert (bool, optional): If True, will keep all documents that are normally discarded. Defaults to False.
- call(
- dataset: nemo_curator.datasets.parallel_dataset.ParallelDataset,
Performs an arbitrary operation on a dataset
Args: dataset (DocumentDataset): The dataset to operate on
- class modules.filter.Score(
- score_fn: collections.abc.Callable | nemo_curator.filters.DocumentFilter,
- score_field: str,
- text_field: str = 'text',
- score_type: type | str | None = None,
Bases:
nemo_curator.modules.base.BaseModule
The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.
Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.
Initialization
Constructs a Score module.
Args: score_fn (Callable | DocumentFilter): The score function or the DocumentFilter object. If it is a DocumentFilter object, the score_fn will be the score_document method of the DocumentFilter. score_field (str): The field the score will be stored in. text_field (str): The field the documents will be read from. score_type (Union[type, str]): The datatype of the score that will be made for each document.
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Applies the scoring to a dataset
Args: dataset (DocumentDataset): The dataset to apply the module to
Returns: DocumentDataset: A dataset with the new score
- class modules.filter.ScoreFilter(
- filter_obj: nemo_curator.filters.DocumentFilter,
- text_field: str = 'text',
- score_field: str | None = None,
- score_type: type | str | None = None,
- invert: bool = False,
Bases:
nemo_curator.modules.base.BaseModule
The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.
The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.
Initialization
Constructs a ScoreFilter module.
Args: filter_obj (DocumentFilter): The score function that takes in a document string and outputs a score for the document. text_field (str): The field the documents will be read from. score_field: The field to which the scores will be written. If None, scores will be immediately discarded after use. score_type (Union[type, str]): The datatype of the score that will be made for each document. invert (bool): If True, will keep all documents that are normally discarded.
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Scores and filters all records in the dataset
Args: dataset (DocumentDataset): The dataset to apply the module to
Returns: DocumentDataset: A dataset with the score and filter applied
- compute_filter_mask(
- dataset: nemo_curator.datasets.DocumentDataset,
Compute the bool mask to filter the dataset.
Args: dataset (DocumentDataset): The dataset to compute filter mask on.
Returns: Series or DataFrame: A mask corresponding to each data instance indicating whether it will be retained.