nemo_curator.stages.math.download.html_extractors.lynx

View as Markdown

Module Contents

Classes

NameDescription
LynxExtractorExtract text from HTML using the lynx command-line browser.

API

class nemo_curator.stages.math.download.html_extractors.lynx.LynxExtractor(
timeout_sec: int = 20
)

Extract text from HTML using the lynx command-line browser.

nemo_curator.stages.math.download.html_extractors.lynx.LynxExtractor.extract_text(
html: str
) -> str

Extract text from HTML content.

Returns empty string on any failure (timeout, encoding errors, etc).