nemo_curator.stages.text.download.wikipedia.extract
nemo_curator.stages.text.download.wikipedia.extract
Module Contents
Classes
Data
API
Bases: DocumentExtractor
Extractor for Wikipedia articles from MediaWiki XML dumps.
Create filter functions for Wikipedia content processing.
Create regex patterns for filtering Wikipedia content.
Process sections of the Wikipedia article.
Extract and clean Wikipedia article content.
Parameters:
record
Record containing raw_content field with Wikipedia markup
Returns: dict[str, Any] | None
Dict with cleaned text and metadata, or None if extraction fails
Define the input columns expected by this extractor.
Define the output columns produced by this extractor.