nemo_curator.stages.text.filters.doc_filter

View as Markdown

Module Contents

Classes

NameDescription
DocumentFilterAn abstract base class for text-based document filters.

API

class nemo_curator.stages.text.filters.doc_filter.DocumentFilter()
Abstract

An abstract base class for text-based document filters.

This class serves as a template for creating specific document filters in the library. Subclasses should implement the abstract methods to define custom filtering behavior.

_name
= self.__class__.__name__
name
str
ngrams
dict
paragraphs
list
sentences
list
nemo_curator.stages.text.filters.doc_filter.DocumentFilter.keep_document(
scores: float | list[int | float]
) -> bool
abstract

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Parameters:

scores
float | list[int | float]

The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool

True if the document should be kept, False otherwise.

Raises:

  • NotImplementedError: If the method is not implemented in a subclass.
nemo_curator.stages.text.filters.doc_filter.DocumentFilter.score_document(
text: str
) -> float | list[int | float]
abstract

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Parameters:

text
str

The text content of the document to be scored.

Returns: float | list[int | float]

A score or set of scores representing the document’s

Raises:

  • NotImplementedError: If the method is not implemented in a subclass.