stages.text.download.common_crawl.extract
#
Module Contents#
Classes#
Abstract base class for document extractors. |
API#
- class stages.text.download.common_crawl.extract.CommonCrawlHTMLExtractor(
- algorithm: nemo_curator.stages.text.download.html_extractors.HTMLExtractorAlgorithm | str | None = None,
- algorithm_kwargs: dict | None = None,
- stop_lists: dict[str, frozenset[str]] | None = None,
Bases:
nemo_curator.stages.text.download.DocumentExtractor
Abstract base class for document extractors.
Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.
Initialization
- extract(record: dict[str, Any]) dict[str, Any] | None #
Extract text from HTML content in the record.
Takes a record dict containing “content” field with HTML and returns a new dict with only the output columns: url, warc_id, source_id, language, text.
- input_columns() list[str] #
Define input columns - produces DocumentBatch with records.
- output_columns() list[str] #
Define output columns - produces DocumentBatch with records.