stages.text.download.wikipedia.iterator
#
Module Contents#
Classes#
Processes downloaded Wikipedia dump files and extracts article content. |
API#
- class stages.text.download.wikipedia.iterator.WikipediaIterator(language: str = 'en', log_frequency: int = 1000)#
Bases:
nemo_curator.stages.text.download.DocumentIterator
Processes downloaded Wikipedia dump files and extracts article content.
Initialization
Initialize the Wikipedia iterator.
Args: language: Language code for the Wikipedia dump log_frequency: How often to log progress (every N articles)
- iterate(file_path: str) collections.abc.Iterator[dict[str, Any]] #
Process a Wikipedia dump file and extract article content.
Args: file_path: Path to the downloaded .bz2 file
Yields: Dict containing article metadata and raw content
- output_columns() list[str] #
Define the output columns produced by this iterator.