stages.text.download.html_extractors.resiliparse#
Module Contents#
Classes#
API#
- class stages.text.download.html_extractors.resiliparse.ResiliparseExtractor(
- required_stopword_density: float = 0.32,
- main_content: bool = True,
- alt_texts: bool = False,
Bases:
stages.text.download.html_extractors.base.HTMLExtractorAlgorithmInitialization
Initialize the Resiliparse text extraction algorithm with specified parameters.
The Resiliparse algorithm extracts structural or semantic information from noisy raw web data for further processing, such as (main) content extraction / boilerplate removal, schema extraction, general web data cleansing, and more.
It is implemented via the
. Onlyextract_plain_textfunction in theresiliparse.extract.html2textmodule. Resiliparse HTML2Text is a very fast and rule-based plain text extractor for HTML pages which uses the Resiliparse DOM parser. Theextract_plain_textfunction extracts all visible text nodes inside the HTML document’s