nemo_curator.stages.math.download.extract
nemo_curator.stages.math.download.extract
Module Contents
Classes
Functions
API
Dataclass
Bases: DocumentExtractor
Extractor that decodes bytes, detects type, and extracts text using Lynx for HTML.
_lock
_lynx
_magic
binary_column
lynx_timeout_sec
mime_type_column
url_column
Check if content is a Jupyter notebook.
Dataclass
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Processing stage that applies a DocumentExtractor row-by-row to a DocumentBatch.
Designed for use after CommonCrawlWARCReader, where binary content has already been fetched into a DocumentBatch. Each row is passed to the extractor and rows where extraction returns None are filtered out.
add_filename_column
extractor