nemo_curator.stages.text.filters.score_filter
nemo_curator.stages.text.filters.score_filter
Module Contents
Classes
Functions
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.
If a list of DocumentFilters is provided, the filters are applied in order. In this case, the filter_field parameter should be a list of strings corresponding to the filters. If some filters should be inverted and others not, then invert should be a list of booleans corresponding to the filters.
Parameters:
A function (or list of functions) that returns True if the document is to be kept or a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter.
The field (or list of fields) to be passed into the filter function.
Whether to invert the filter condition.
Compute the bool mask to filter the dataset.
Parameters:
The dataset to compute filter mask on.
The filter function to use.
The field to read the filter from.
Whether to invert the filter condition.
Returns: pd.Series
A mask corresponding to each data instance indicating whether it will be retained.
Applies the filtering to a dataset
Parameters:
The batch to apply the module to
Returns: DocumentBatch | None
A batch with entries removed according to the filter
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.
Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.
If a list of DocumentFilters is provided, the filters are applied in order. In this case, the score_field parameter should be a list of strings corresponding to the filters. If different filters should be applied to different text fields, then text_field should be a list of strings corresponding to the filters.
Parameters:
The score function or the DocumentFilter object (or list of DocumentFilters). If it is a DocumentFilter object, the score_fn will be the score_document method of the DocumentFilter.
The field (or list of fields) the score will be stored in.
The field (or list of fields) the documents will be read from.
Applies the scoring to a dataset
Parameters:
The batch to apply the module to
Returns: DocumentBatch | None
A batch with the new score
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
The module responsible for applying a filter (or chain of filters) to all documents in a dataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.
The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.
If a list of DocumentFilters is provided, the filters are applied in order. If different filters should be applied to different text fields, then text_field should be a list of strings corresponding to the filters. If different score fields should be created for each filter, then score_field should be a list of strings corresponding to the filters. If some filters should be inverted and others not, then invert should be a list of booleans corresponding to the filters.
Parameters:
The score function (or list of score functions) that takes in a document string and outputs a score for the document.
The field (or list of fields) the documents will be read from.
The field (or list of fields) to which the scores will be written. If None, scores will be immediately discarded after use.
If True, will keep all documents that are normally discarded.
Compute the bool mask to filter the dataset.
Parameters:
The dataset to compute filter mask on.
The filter object to use.
The field to read the text from.
The field to write the scores to.
Whether to invert the filter condition.
Returns: pd.Series
A mask corresponding to each data instance indicating whether it will be retained.
Scores and filters all records in the dataset
Parameters:
The batch to apply the module to
Returns: DocumentBatch | None
A batch with the score and filter applied
In the case of a list of DocumentFilters or Callables, format the relevant field (filter_field, score_field, text_field, invert) to a list of length equal to the number of filters.
Parameters:
The field to check and format.
The number of filters. This will be the length of the output list.
The name of the field, which is used in error messages.
The type of the field, which is used in an isinstance check.
Returns: list[str] | list[bool]
list[str] | list[bool]: The reformatted field.
In the case of a single DocumentFilter or Callable, format the relevant field (filter_field, score_field, text_field, invert) to a list of length 1.
Parameters:
The field to check and format.
The name of the field, which is used in error messages.
The type of the field, which is used in an isinstance check.
Returns: list[str] | list[bool]
list[str] | list[bool]: The reformatted field.
Derive the stage name from the provided score/filter functions.
Validate and normalize all parameters needed for the Score, Filter, and ScoreFilter modules. “Normalize” means to reformat all parameters to a list of length equal to the number of filters.
Parameters:
The filter object or list of filter objects.
The input field. For Score and ScoreFilter, this is the text field. For Filter, this is the filter field.
The invert flag. This is used for Filter and ScoreFilter.
The output field. For Score and ScoreFilter, this is the score field. For Filter, this is not used.
The type of the module.
Returns: tuple[str, list[DocumentFilter | Callable], list[str] | None, list[bool] | None, list[str] | None]
tuple[str, list[DocumentFilter | Callable], list[str] | None, list[bool] | None, list[str] | None]: The first string returned corresponds to the name given to the DocumentFilter or Callable. The normalized filters, input fields, invert flags, and output fields make up the rest of the tuple.