stages.text.download.html_extractors.justext#
Module Contents#
Classes#
API#
- class stages.text.download.html_extractors.justext.JusTextExtractor(
- length_low: int = 70,
- length_high: int = 200,
- stopwords_low: float = 0.3,
- stopwords_high: float = 0.32,
- max_link_density: float = 0.2,
- max_heading_distance: int = 200,
- no_headings: bool = False,
- is_boilerplate: bool | None = None,
Bases:
stages.text.download.html_extractors.base.HTMLExtractorAlgorithmInitialization
Initialize the jusText text extraction algorithm with specified parameters.
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. The key idea is that long blocks can often be classified with high confidence, while shorter blocks require context-based adjustments.
Here is an overview of the jusText algorithm: • Segmentation: The document is split into textual blocks based on HTML tags that typically define separate sections (e.g.,
,,
). • Preprocessing: Contents of
,