`stages.text.download.wikipedia.iterator`#

Module Contents#

Processes downloaded Wikipedia dump files and extracts article content.

class stages.text.download.wikipedia.iterator.WikipediaIterator(language: str = 'en', log_frequency: int = 1000)#

Bases: nemo_curator.stages.text.download.DocumentIterator

Processes downloaded Wikipedia dump files and extracts article content.

Initialization

Initialize the Wikipedia iterator.

Args: language: Language code for the Wikipedia dump log_frequency: How often to log progress (every N articles)

iterate(file_path: str) → collections.abc.Iterator[dict[str, Any]]#

Process a Wikipedia dump file and extract article content.

Args: file_path: Path to the downloaded .bz2 file

Yields: Dict containing article metadata and raw content

output_columns() → list[str]#: Define the output columns produced by this iterator.