stages.text.download.wikipedia.extract#

Module Contents#

Classes#

WikipediaExtractor

Extractor for Wikipedia articles from MediaWiki XML dumps.

Data#

API#

stages.text.download.wikipedia.extract.CAT_ALIASES#

None

stages.text.download.wikipedia.extract.MEDIA_ALIASES#

None

class stages.text.download.wikipedia.extract.WikipediaExtractor(language: str = 'en')#

Bases: nemo_curator.stages.text.download.DocumentExtractor

Extractor for Wikipedia articles from MediaWiki XML dumps.

Initialization

Initialize the Wikipedia extractor.

Args: language: Language code for the Wikipedia articles

extract(record: dict[str, Any]) dict[str, Any] | None#

Extract and clean Wikipedia article content.

Args: record: Record containing raw_content field with Wikipedia markup

Returns: Dict with cleaned text and metadata, or None if extraction fails

input_columns() list[str]#

Define the input columns expected by this extractor.

output_columns() list[str]#

Define the output columns produced by this extractor.