stages.text.download.wikipedia.extract
#
Module Contents#
Classes#
Extractor for Wikipedia articles from MediaWiki XML dumps. |
Data#
API#
- stages.text.download.wikipedia.extract.CAT_ALIASES#
None
- stages.text.download.wikipedia.extract.MEDIA_ALIASES#
None
- class stages.text.download.wikipedia.extract.WikipediaExtractor(language: str = 'en')#
Bases:
nemo_curator.stages.text.download.DocumentExtractor
Extractor for Wikipedia articles from MediaWiki XML dumps.
Initialization
Initialize the Wikipedia extractor.
Args: language: Language code for the Wikipedia articles
- extract(record: dict[str, Any]) dict[str, Any] | None #
Extract and clean Wikipedia article content.
Args: record: Record containing raw_content field with Wikipedia markup
Returns: Dict with cleaned text and metadata, or None if extraction fails
- input_columns() list[str] #
Define the input columns expected by this extractor.
- output_columns() list[str] #
Define the output columns produced by this extractor.