nemo_curator.stages.math.download.extract

View as Markdown

Module Contents

Classes

NameDescription
MathContentExtractorExtractor that decodes bytes, detects type, and extracts text using Lynx for HTML.
MathExtractStageProcessing stage that applies a DocumentExtractor row-by-row to a DocumentBatch.

Functions

API

class nemo_curator.stages.math.download.extract.MathContentExtractor(
binary_column: str = 'binary_content',
url_column: str = 'url',
mime_type_column: str = 'mime_type',
lynx_timeout_sec: int = 20
)
Dataclass

Bases: DocumentExtractor

Extractor that decodes bytes, detects type, and extracts text using Lynx for HTML.

_lock
Lock
_lynx
Any | None = field(default=None, init=False, repr=False)
_magic
Any | None = field(default=None, init=False, repr=False)
binary_column
str = 'binary_content'
lynx_timeout_sec
int = 20
mime_type_column
str = 'mime_type'
url_column
str = 'url'
nemo_curator.stages.math.download.extract.MathContentExtractor.__post_init__()
nemo_curator.stages.math.download.extract.MathContentExtractor._determine_type(
content: str | None,
magic_mime_type: str | None,
mime_type: str | None,
url: str | None
) -> str
nemo_curator.stages.math.download.extract.MathContentExtractor._is_html_document(
text: str
) -> bool
nemo_curator.stages.math.download.extract.MathContentExtractor._is_notebook_type(
content: str,
magic_mime_type: str | None,
url: str | None
) -> bool

Check if content is a Jupyter notebook.

nemo_curator.stages.math.download.extract.MathContentExtractor.extract(
record: dict[str, typing.Any]
) -> dict[str, typing.Any] | None
nemo_curator.stages.math.download.extract.MathContentExtractor.input_columns() -> list[str]
nemo_curator.stages.math.download.extract.MathContentExtractor.output_columns() -> list[str]
class nemo_curator.stages.math.download.extract.MathExtractStage(
extractor: nemo_curator.stages.text.download.base.extract.DocumentExtractor,
add_filename_column: bool | str = False
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Processing stage that applies a DocumentExtractor row-by-row to a DocumentBatch.

Designed for use after CommonCrawlWARCReader, where binary content has already been fetched into a DocumentBatch. Each row is passed to the extractor and rows where extraction returns None are filtered out.

add_filename_column
bool | str = False
extractor
DocumentExtractor
nemo_curator.stages.math.download.extract.MathExtractStage.__post_init__() -> None
nemo_curator.stages.math.download.extract.MathExtractStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.download.extract.MathExtractStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.download.extract.MathExtractStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
nemo_curator.stages.math.download.extract._decode_bytes(
binary_content: bytes | None
) -> str | None
nemo_curator.stages.math.download.extract._is_notebook(
content: str
) -> bool
nemo_curator.stages.math.download.extract._notebook_to_text(
content: str
) -> str
nemo_curator.stages.math.download.extract._remove_xml_encoding_declaration(
text: str
) -> str