stages.text.download.common_crawl.extract#

Module Contents#

Classes#

CommonCrawlHTMLExtractor

Abstract base class for document extractors.

API#

class stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor(
algorithm: nemo_curator.stages.text.download.html_extractors.HTMLExtractorAlgorithm | str | None = None,
algorithm_kwargs: dict | None = None,
stop_lists: dict[str, frozenset[str]] | None = None,
)#

Bases: nemo_curator.stages.text.download.DocumentExtractor

Abstract base class for document extractors.

Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.

Initialization

extract(record: dict[str, Any]) dict[str, Any] | None#

Extract text from HTML content in the record.

Takes a record dict containing “content” field with HTML and returns a new dict with only the output columns: url, warc_id, source_id, language, text.

input_columns() list[str]#

Define input columns - produces DocumentBatch with records.

output_columns() list[str]#

Define output columns - produces DocumentBatch with records.