stages.text.download.wikipedia.iterator#

Module Contents#

Classes#

WikipediaIterator

Processes downloaded Wikipedia dump files and extracts article content.

API#

class stages.text.download.wikipedia.iterator.WikipediaIterator(language: str = 'en', log_frequency: int = 1000)#

Bases: nemo_curator.stages.text.download.DocumentIterator

Processes downloaded Wikipedia dump files and extracts article content.

Initialization

Initialize the Wikipedia iterator.

Args: language: Language code for the Wikipedia dump log_frequency: How often to log progress (every N articles)

iterate(file_path: str) collections.abc.Iterator[dict[str, Any]]#

Process a Wikipedia dump file and extract article content.

Args: file_path: Path to the downloaded .bz2 file

Yields: Dict containing article metadata and raw content

output_columns() list[str]#

Define the output columns produced by this iterator.