nemo_curator.stages.text.filters.score_filter

View as Markdown

Module Contents

Classes

NameDescription
FilterThe module responsible for filtering records based on a metadata field.
ScoreThe module responsible for adding metadata to records based on statistics about the text.
ScoreFilterThe module responsible for applying a filter (or chain of filters) to all documents in a dataset.

Functions

NameDescription
_filter_name-
_format_field_listIn the case of a list of DocumentFilters or Callables, format the relevant field
_format_single_field_listIn the case of a single DocumentFilter or Callable, format the relevant field
_get_filter_stage_nameDerive the stage name from the provided score/filter functions.
_validate_and_normalize_filtersValidate and normalize all parameters needed for the Score, Filter, and ScoreFilter modules.

API

class nemo_curator.stages.text.filters.score_filter.Filter(
filter_fn: collections.abc.Callable | nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter], filter_fn: collections.abc.Callable | nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter],
filter_field: str | list[str],
invert: bool | list[bool] = False,
name: str = 'filter_fn'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

The module responsible for filtering records based on a metadata field. It accepts an arbitrary filter function that accepts a metadata field and returns True if the field should be kept. It also accepts a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter. Unlike ScoreFilter, it does not compute the metadata based on a document. It only filters using existing metadata.

If a list of DocumentFilters is provided, the filters are applied in order. In this case, the filter_field parameter should be a list of strings corresponding to the filters. If some filters should be inverted and others not, then invert should be a list of booleans corresponding to the filters.

Parameters:

filter_fn
Callable | DocumentFilter | list[DocumentFilter]

A function (or list of functions) that returns True if the document is to be kept or a DocumentFilter object, in which case the filter_fn will be the keep_document method of the DocumentFilter.

filter_field
str | list[str]

The field (or list of fields) to be passed into the filter function.

invert
bool | list[bool]Defaults to False

Whether to invert the filter condition.

filter_field
str | list[str]
filter_fn
Callable | DocumentFilter | list[DocumentFilter]
invert
bool | list[bool] = False
name
str = 'filter_fn'
nemo_curator.stages.text.filters.score_filter.Filter.__post_init__()
nemo_curator.stages.text.filters.score_filter.Filter.compute_filter_mask(
df: pandas.DataFrame,
filter_fn: collections.abc.Callable | nemo_curator.stages.text.filters.doc_filter.DocumentFilter,
filter_field: str,
invert: bool
) -> pandas.Series

Compute the bool mask to filter the dataset.

Parameters:

df
pd.DataFrame

The dataset to compute filter mask on.

filter_fn
Callable | DocumentFilter

The filter function to use.

filter_field
str

The field to read the filter from.

invert
bool

Whether to invert the filter condition.

Returns: pd.Series

A mask corresponding to each data instance indicating whether it will be retained.

nemo_curator.stages.text.filters.score_filter.Filter.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.filters.score_filter.Filter.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.filters.score_filter.Filter.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch | None

Applies the filtering to a dataset

Parameters:

batch
DocumentBatch

The batch to apply the module to

Returns: DocumentBatch | None

A batch with entries removed according to the filter

class nemo_curator.stages.text.filters.score_filter.Score(
score_fn: collections.abc.Callable[[str], float | str] | nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter], score_fn: collections.abc.Callable[[str], float | str] | nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter],
score_field: str | list[str],
text_field: str | list[str] = 'text',
name: str = 'score_fn'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

The module responsible for adding metadata to records based on statistics about the text. It accepts an arbitrary scoring function that accepts a text field and returns a score. It also accepts a DocumentFilter object, in which case the score_fn will be the score_document method of the DocumentFilter.

Unlike ScoreFilter, it does not filter based on the computed score. It only adds metadata to the record.

If a list of DocumentFilters is provided, the filters are applied in order. In this case, the score_field parameter should be a list of strings corresponding to the filters. If different filters should be applied to different text fields, then text_field should be a list of strings corresponding to the filters.

Parameters:

score_fn
Callable | DocumentFilter | list[DocumentFilter]

The score function or the DocumentFilter object (or list of DocumentFilters). If it is a DocumentFilter object, the score_fn will be the score_document method of the DocumentFilter.

score_field
str | list[str]

The field (or list of fields) the score will be stored in.

text_field
str | list[str]Defaults to 'text'

The field (or list of fields) the documents will be read from.

name
str = 'score_fn'
score_field
str | list[str]
score_fn
Callable[[str], float | str] | DocumentFilter | list[DocumentFilter]
text_field
str | list[str] = 'text'
nemo_curator.stages.text.filters.score_filter.Score.__post_init__()
nemo_curator.stages.text.filters.score_filter.Score.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.filters.score_filter.Score.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.filters.score_filter.Score.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch | None

Applies the scoring to a dataset

Parameters:

batch
DocumentBatch

The batch to apply the module to

Returns: DocumentBatch | None

A batch with the new score

nemo_curator.stages.text.filters.score_filter.Score.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.text.filters.score_filter.Score.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.text.filters.score_filter.Score.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
class nemo_curator.stages.text.filters.score_filter.ScoreFilter(
filter_obj: nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter], filter_obj: nemo_curator.stages.text.filters.doc_filter.DocumentFilter | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter],
text_field: str | list[str] = 'text',
score_field: str | list[str] | None = None,
invert: bool | list[bool] = False,
name: str = 'score_filter'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

The module responsible for applying a filter (or chain of filters) to all documents in a dataset. It accepts an arbitrary DocumentFilter and first computes the score for a document. Then, determines whether to keep the document based on the criteria in the DocumentFilter.

The filter can be applied to any field in the dataset, and the score can be logged for later. Also, the filter can be inverted such that “rejected” documents are kept.

If a list of DocumentFilters is provided, the filters are applied in order. If different filters should be applied to different text fields, then text_field should be a list of strings corresponding to the filters. If different score fields should be created for each filter, then score_field should be a list of strings corresponding to the filters. If some filters should be inverted and others not, then invert should be a list of booleans corresponding to the filters.

Parameters:

filter_obj
DocumentFilter | list[DocumentFilter]

The score function (or list of score functions) that takes in a document string and outputs a score for the document.

text_field
str | list[str]Defaults to 'text'

The field (or list of fields) the documents will be read from.

score_field
str | list[str] | NoneDefaults to None

The field (or list of fields) to which the scores will be written. If None, scores will be immediately discarded after use.

invert
bool | list[bool]Defaults to False

If True, will keep all documents that are normally discarded.

filter_obj
DocumentFilter | list[DocumentFilter]
invert
bool | list[bool] = False
name
str = 'score_filter'
score_field
str | list[str] | None = None
text_field
str | list[str] = 'text'
nemo_curator.stages.text.filters.score_filter.ScoreFilter.__post_init__()
nemo_curator.stages.text.filters.score_filter.ScoreFilter.compute_filter_mask(
df: pandas.DataFrame,
filter_obj: nemo_curator.stages.text.filters.doc_filter.DocumentFilter,
text_field: str,
score_field: str | None,
invert: bool
) -> pandas.Series

Compute the bool mask to filter the dataset.

Parameters:

df
pd.DataFrame

The dataset to compute filter mask on.

filter_obj
DocumentFilter

The filter object to use.

text_field
str

The field to read the text from.

score_field
str | None

The field to write the scores to.

invert
bool

Whether to invert the filter condition.

Returns: pd.Series

A mask corresponding to each data instance indicating whether it will be retained.

nemo_curator.stages.text.filters.score_filter.ScoreFilter.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.filters.score_filter.ScoreFilter.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.filters.score_filter.ScoreFilter.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch | None

Scores and filters all records in the dataset

Parameters:

batch
DocumentBatch

The batch to apply the module to

Returns: DocumentBatch | None

A batch with the score and filter applied

nemo_curator.stages.text.filters.score_filter.ScoreFilter.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.text.filters.score_filter.ScoreFilter.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.text.filters.score_filter.ScoreFilter.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.text.filters.score_filter._filter_name(
x: nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable
) -> str
nemo_curator.stages.text.filters.score_filter._format_field_list(
_field: str | list[str] | None,
filter_count: int,
field_name: str,
field_type: type = str
) -> list[str] | list[bool]

In the case of a list of DocumentFilters or Callables, format the relevant field (filter_field, score_field, text_field, invert) to a list of length equal to the number of filters.

Parameters:

_field
str | list[str] | None

The field to check and format.

filter_count
int

The number of filters. This will be the length of the output list.

field_name
str

The name of the field, which is used in error messages.

field_type
typeDefaults to str

The type of the field, which is used in an isinstance check.

Returns: list[str] | list[bool]

list[str] | list[bool]: The reformatted field.

nemo_curator.stages.text.filters.score_filter._format_single_field_list(
_field: str | list[str] | None,
field_name: str,
field_type: type = str
) -> list[str] | list[bool]

In the case of a single DocumentFilter or Callable, format the relevant field (filter_field, score_field, text_field, invert) to a list of length 1.

Parameters:

_field
str | list[str] | None

The field to check and format.

field_name
str

The name of the field, which is used in error messages.

field_type
typeDefaults to str

The type of the field, which is used in an isinstance check.

Returns: list[str] | list[bool]

list[str] | list[bool]: The reformatted field.

nemo_curator.stages.text.filters.score_filter._get_filter_stage_name(
filters: list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable],
prefix: str
) -> str

Derive the stage name from the provided score/filter functions.

nemo_curator.stages.text.filters.score_filter._validate_and_normalize_filters(
_filter: nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable], _filter: nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable | list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable],
input_field: str | list[str] | None,
invert: bool | list[bool] | None,
output_field: str | list[str] | None,
fn_type: typing.Literal['score', 'filter', 'score_filter']
) -> tuple[str, list[nemo_curator.stages.text.filters.doc_filter.DocumentFilter | collections.abc.Callable], list[str] | None, list[bool] | None, list[str] | None]

Validate and normalize all parameters needed for the Score, Filter, and ScoreFilter modules. “Normalize” means to reformat all parameters to a list of length equal to the number of filters.

Parameters:

_filter
DocumentFilter | Callable | list[DocumentFilter | Callable]

The filter object or list of filter objects.

input_field
str | list[str] | None

The input field. For Score and ScoreFilter, this is the text field. For Filter, this is the filter field.

invert
bool | list[bool] | None

The invert flag. This is used for Filter and ScoreFilter.

output_field
str | list[str] | None

The output field. For Score and ScoreFilter, this is the score field. For Filter, this is not used.

fn_type
Literal['score', 'filter', 'score_filter']

The type of the module.

Returns: tuple[str, list[DocumentFilter | Callable], list[str] | None, list[bool] | None, list[str] | None]

tuple[str, list[DocumentFilter | Callable], list[str] | None, list[bool] | None, list[str] | None]: The first string returned corresponds to the name given to the DocumentFilter or Callable. The normalized filters, input fields, invert flags, and output fields make up the rest of the tuple.