stages.text.modules.score_filter#

Module Contents#

Classes#

Filter

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

Score

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.

ScoreFilter

The module responsible for applying a filter (or chain of filters) to all documents in a dataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

API#

class stages.text.modules.score_filter.Filter#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

If a list of DocumentFilters is provided, the filters are applied in order. In this case, the filter_field parameter should be a list of strings corresponding to the filters. If some filters should be inverted and others not, then invert should be a list of booleans corresponding to the filters.

Args: filter_fn (Callable | DocumentFilter | list[DocumentFilter]): A function (or list of functions) that returns True if the document is to be kept or a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. filter_field (str | list[str]): The field (or list of fields) to be passed into the filter function. invert (bool | list[bool]): Whether to invert the filter condition.

compute_filter_mask(
df: pandas.DataFrame,
filter_fn: collections.abc.Callable | nemo_curator.stages.text.filters.doc_filter.DocumentFilter,
filter_field: str,
invert: bool,
) pandas.Series#

Compute the bool mask to filter the dataset.

Args: df (pd.DataFrame): The dataset to compute filter mask on. filter_fn (Callable | DocumentFilter): The filter function to use. filter_field (str): The field to read the filter from. invert (bool): Whether to invert the filter condition.

Returns: Series: A mask corresponding to each data instance indicating whether it will be retained.

filter_field: str | list[str]#

None

filter_fn: collections.abc.Callable | nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter]#

None

inputs() tuple[list[str], list[str]]#

Define stage input requirements.

Returns (tuple[list[str], list[str]]): Tuple of (required_attributes, required_columns) where: - required_top_level_attributes: List of task attributes that must be present - required_data_attributes: List of attributes within the data that must be present

invert: bool | list[bool]#

False

name: str#

‘filter_fn’

outputs() tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch | None#

Applies the filtering to a dataset

Args: batch (DocumentBatch): The batch to apply the module to

Returns: DocumentBatch: A batch with entries removed according to the filter

class stages.text.modules.score_filter.Score#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.

Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.

If a list of DocumentFilters is provided, the filters are applied in order. In this case, the score_field parameter should be a list of strings corresponding to the filters. If different filters should be applied to different text fields, then text_field should be a list of strings corresponding to the filters.

Args: score_fn (Callable | DocumentFilter | list[DocumentFilter]): The score function or the DocumentFilter object (or list of DocumentFilters). If it is a DocumentFilter object, the score_fn will be the score_document method of the DocumentFilter. score_field (str | list[str]): The field (or list of fields) the score will be stored in. text_field (str | list[str]): The field (or list of fields) the documents will be read from.

inputs() tuple[list[str], list[str]]#

Define stage input requirements.

Returns (tuple[list[str], list[str]]): Tuple of (required_attributes, required_columns) where: - required_top_level_attributes: List of task attributes that must be present - required_data_attributes: List of attributes within the data that must be present

name: str#

‘score_fn’

outputs() tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch | None#

Applies the scoring to a dataset

Args: batch (DocumentBatch): The batch to apply the module to

Returns: DocumentBatch: A batch with the new score

ray_stage_spec() dict[str, Any]#

Get Ray configuration for this stage. Note : This is only used for Ray Data which is an experimental backend. The keys are defined in RayStageSpecKeys in backends/experimental/ray_data/utils.py

Returns (dict[str, Any]): Dictionary containing Ray-specific configuration

score_field: str | list[str]#

None

score_fn: collections.abc.Callable[[str], float | str] | nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter]#

None

setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None,
) None#

Setup method called once before processing begins. Override this method to perform any initialization that should happen once per worker. Args: worker_metadata (WorkerMetadata, optional): Information about the worker (provided by some backends)

setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None,
) None#

Setup method called once per node in distributed settings. Override this method to perform node-level initialization. Args: node_info (NodeInfo, optional): Information about the node (provided by some backends) worker_metadata (WorkerMetadata, optional): Information about the worker (provided by some backends)

text_field: str | list[str]#

‘text’

class stages.text.modules.score_filter.ScoreFilter#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

The module responsible for applying a filter (or chain of filters) to all documents in a dataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.

If a list of DocumentFilters is provided, the filters are applied in order. If different filters should be applied to different text fields, then text_field should be a list of strings corresponding to the filters. If different score fields should be created for each filter, then score_field should be a list of strings corresponding to the filters. If some filters should be inverted and others not, then invert should be a list of booleans corresponding to the filters.

Args: filter_obj (DocumentFilter | list[DocumentFilter]): The score function (or list of score functions) that takes in a document string and outputs a score for the document. text_field (str | list[str]): The field (or list of fields) the documents will be read from. score_field (str | list[str] | None): The field (or list of fields) to which the scores will be written. If None, scores will be immediately discarded after use. invert (bool | list[bool]): If True, will keep all documents that are normally discarded.

compute_filter_mask(
df: pandas.DataFrame,
filter_obj: nemo_curator.stages.text.filters.doc_filter.DocumentFilter,
text_field: str,
score_field: str | None,
invert: bool,
) pandas.Series#

Compute the bool mask to filter the dataset.

Args: df (pd.DataFrame): The dataset to compute filter mask on. filter_obj (DocumentFilter): The filter object to use. text_field (str): The field to read the text from. score_field (str | None): The field to write the scores to. invert (bool): Whether to invert the filter condition.

Returns: Series: A mask corresponding to each data instance indicating whether it will be retained.

filter_obj: nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter]#

None

inputs() tuple[list[str], list[str]]#

Define stage input requirements.

Returns (tuple[list[str], list[str]]): Tuple of (required_attributes, required_columns) where: - required_top_level_attributes: List of task attributes that must be present - required_data_attributes: List of attributes within the data that must be present

invert: bool | list[bool]#

False

name: str#

‘score_filter’

outputs() tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch | None#

Scores and filters all records in the dataset

Args: batch (DocumentBatch): The batch to apply the module to

Returns: DocumentBatch: A batch with the score and filter applied

ray_stage_spec() dict[str, Any]#

Get Ray configuration for this stage. Note : This is only used for Ray Data which is an experimental backend. The keys are defined in RayStageSpecKeys in backends/experimental/ray_data/utils.py

Returns (dict[str, Any]): Dictionary containing Ray-specific configuration

score_field: str | list[str] | None#

None

setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None,
) None#

Setup method called once before processing begins. Override this method to perform any initialization that should happen once per worker. Args: worker_metadata (WorkerMetadata, optional): Information about the worker (provided by some backends)

setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None,
) None#

Setup method called once per node in distributed settings. Override this method to perform node-level initialization. Args: node_info (NodeInfo, optional): Information about the node (provided by some backends) worker_metadata (WorkerMetadata, optional): Information about the worker (provided by some backends)

text_field: str | list[str]#

‘text’