stages.text.download.html_extractors.resiliparse#

Module Contents#

Classes#

API#

class stages.text.download.html_extractors.resiliparse.ResiliparseExtractor(
required_stopword_density: float = 0.32,
main_content: bool = True,
alt_texts: bool = False,
)#

Bases: stages.text.download.html_extractors.base.HTMLExtractorAlgorithm

Initialization

Initialize the Resiliparse text extraction algorithm with specified parameters.

The Resiliparse algorithm extracts structural or semantic information from noisy raw web data for further processing, such as (main) content extraction / boilerplate removal, schema extraction, general web data cleansing, and more.

It is implemented via the extract_plain_text function in the resiliparse.extract.html2text module. Resiliparse HTML2Text is a very fast and rule-based plain text extractor for HTML pages which uses the Resiliparse DOM parser. The extract_plain_text function extracts all visible text nodes inside the HTML document’s . Only