nemo_curator.stages.text.download.wikipedia.iterator
nemo_curator.stages.text.download.wikipedia.iterator
Module Contents
Classes
API
Bases: DocumentIterator
Processes downloaded Wikipedia dump files and extracts article content.
_counter
Extract text from an XML element.
Extract raw content from Wikipedia article element.
Extract metadata from a Wikipedia article element.
Check if progress should be logged based on counter.
Check if article should be skipped based on metadata and content.
Process a Wikipedia dump file and extract article content.
Parameters:
file_path
Path to the downloaded .bz2 file
Define the output columns produced by this iterator.