`stages.text.download.common_crawl.extract`#

Module Contents#

Classes#

CommonCrawlHTMLExtractor

Abstract base class for document extractors.

API#

class stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor( algorithm: nemo_curator.stages.text.download.html_extractors.HTMLExtractorAlgorithm | str | None = None, algorithm_kwargs: dict | None = None, stop_lists: dict[str, frozenset[str]] | None = None, )#

Bases: nemo_curator.stages.text.download.DocumentExtractor

Abstract base class for document extractors.

Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.

Initialization

extract(record: dict[str, Any]) → dict[str, Any] | None#

Extract text from HTML content in the record.

Takes a record dict containing “content” field with HTML and returns a new dict with only the output columns: url, warc_id, source_id, language, text.

input_columns() → list[str]#: Define input columns - produces DocumentBatch with records.

output_columns() → list[str]#: Define output columns - produces DocumentBatch with records.

stages.text.download.common_crawl.extract#

Module Contents#

Classes#

API#

`stages.text.download.common_crawl.extract`#