modules.filter#

Module Contents#

Classes#

Filter

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

ParallelScoreFilter

Base class for all NeMo Curator modules.

Score

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.

ScoreFilter

The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

API#

class modules.filter.Filter(
filter_fn: collections.abc.Callable | nemo_curator.filters.DocumentFilter,
filter_field: str,
invert: bool = False,
)#

Bases: nemo_curator.modules.base.BaseModule

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

Initialization

Constructs a Filter module

Args: filter_fn (Callable | DocumentFilter): A function that returns True if the document is to be kept or a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. filter_field (str): The field(s) to be passed into the filter function. invert (bool): Whether to invert the filter condition.

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Applies the filtering to a dataset

Args: dataset (DocumentDataset): The dataset to apply the module to

Returns: DocumentDataset: A dataset with entries removed according to the filter

compute_filter_mask(
dataset: nemo_curator.datasets.DocumentDataset,
) pandas.Series | pandas.DataFrame#

Compute the bool mask to filter the dataset.

Args: dataset (DocumentDataset): The dataset to compute filter mask on.

Returns: Series or DataFrame: A mask corresponding to each data instance indicating whether it will be retained.

class modules.filter.ParallelScoreFilter(
src_filter_obj: nemo_curator.filters.DocumentFilter,
tgt_filter_obj: nemo_curator.filters.DocumentFilter,
src_field: str = 'src',
tgt_field: str = 'tgt',
src_score: str | None = None,
tgt_score: str | None = None,
score_type: str | None = None,
invert: bool = False,
)#

Bases: nemo_curator.modules.base.BaseModule

Base class for all NeMo Curator modules.

Handles validating that data lives on the correct device for each module

Initialization

A filter object wrapper class for applying monolingual filter objects on bitext. If either side of the bitext is discarded, the whole bitext pair is discarded. If you want to apply a bitext filter that takes both the source and target as input, checkout BitextFilter class.

Note that the goal of this wrapper class is to group the same/similar filters on bitext thus making the logic clearer, which is why we force the score_type and invert to be the same among source/target filters. If you need the extra flexibility, you should fall back to applying two filters one after the other.

Args: src_filter_obj (type): The score function that takes in a document string and outputs a score for the source document. tgt_filter_obj (type): The score function that takes in a document string and outputs a score for the target document. src_field (str, optional): The field the source documents will be read from. Defaults to “src”. tgt_field (str, optional): The field the target documents will be read from. Defaults to “tgt”. src_score (str, optional): The field to which the source scores will be written. If None, scores will be immediately discarded after use. Defaults to None. tgt_score (str, optional): The field to which the target scores will be written. If None, scores will be immediately discarded after use. Defaults to None. score_type (Optional[str]): The datatype of the score that will be made for each document. Defaults to None. invert (bool, optional): If True, will keep all documents that are normally discarded. Defaults to False.

call(
dataset: nemo_curator.datasets.parallel_dataset.ParallelDataset,
) nemo_curator.datasets.parallel_dataset.ParallelDataset#

Performs an arbitrary operation on a dataset

Args: dataset (DocumentDataset): The dataset to operate on

class modules.filter.Score(
score_fn: collections.abc.Callable | nemo_curator.filters.DocumentFilter,
score_field: str,
text_field: str = 'text',
score_type: type | str | None = None,
)#

Bases: nemo_curator.modules.base.BaseModule

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.

Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.

Initialization

Constructs a Score module.

Args: score_fn (Callable | DocumentFilter): The score function or the DocumentFilter object. If it is a DocumentFilter object, the score_fn will be the score_document method of the DocumentFilter. score_field (str): The field the score will be stored in. text_field (str): The field the documents will be read from. score_type (Union[type, str]): The datatype of the score that will be made for each document.

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Applies the scoring to a dataset

Args: dataset (DocumentDataset): The dataset to apply the module to

Returns: DocumentDataset: A dataset with the new score

class modules.filter.ScoreFilter(
filter_obj: nemo_curator.filters.DocumentFilter,
text_field: str = 'text',
score_field: str | None = None,
score_type: type | str | None = None,
invert: bool = False,
)#

Bases: nemo_curator.modules.base.BaseModule

The module responsible for applying a filter to all documents in a DocumentDataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.

Initialization

Constructs a ScoreFilter module.

Args: filter_obj (DocumentFilter): The score function that takes in a document string and outputs a score for the document. text_field (str): The field the documents will be read from. score_field: The field to which the scores will be written. If None, scores will be immediately discarded after use. score_type (Union[type, str]): The datatype of the score that will be made for each document. invert (bool): If True, will keep all documents that are normally discarded.

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Scores and filters all records in the dataset

Args: dataset (DocumentDataset): The dataset to apply the module to

Returns: DocumentDataset: A dataset with the score and filter applied

compute_filter_mask(
dataset: nemo_curator.datasets.DocumentDataset,
) pandas.Series | pandas.DataFrame#

Compute the bool mask to filter the dataset.

Args: dataset (DocumentDataset): The dataset to compute filter mask on.

Returns: Series or DataFrame: A mask corresponding to each data instance indicating whether it will be retained.