nemo_curator.stages.text.download.html_extractors.base

View as Markdown

Module Contents

Classes

NameDescription
HTMLExtractorAlgorithm-

API

class nemo_curator.stages.text.download.html_extractors.base.HTMLExtractorAlgorithm()
Abstract
NON_SPACED_LANGUAGES
nemo_curator.stages.text.download.html_extractors.base.HTMLExtractorAlgorithm.extract_text(
html: str,
stop_words: frozenset[str],
language: str
) -> list[str] | None
abstract