stages.text.download.wikipedia.stage
#
Module Contents#
Classes#
Composite stage for downloading and processing Wikipedia data. |
API#
- class stages.text.download.wikipedia.stage.WikipediaDownloadExtractStage(
- language: str = 'en',
- download_dir: str = './wikipedia_downloads',
- dump_date: str | None = None,
- wikidumps_index_prefix: str = 'https://dumps.wikimedia.org',
- verbose: bool = False,
- url_limit: int | None = None,
- record_limit: int | None = None,
- add_filename_column: bool | str = True,
- log_frequency: int = 1000,
Bases:
nemo_curator.stages.text.download.DocumentDownloadExtractStage
Composite stage for downloading and processing Wikipedia data.
This pipeline:
Generates Wikipedia dump URLs for the specified language and date
Downloads Wikipedia .bz2 dump files
Extracts articles from the dump files
Cleans and extracts text from Wikipedia markup
Initialization
Initialize the Wikipedia download and extract stage.
Args: language: Language code for the Wikipedia dump (e.g., “en”, “es”, “fr”) download_dir: Directory to store downloaded .bz2 files dump_date: Specific dump date in “YYYYMMDD” format (if None, uses latest) wikidumps_index_prefix: Base URL for Wikipedia dumps verbose: If True, enables verbose logging url_limit: Maximum number of dump URLs to process record_limit: Maximum number of articles to extract per file add_filename_column: Whether to add filename column to output log_frequency: How often to log progress during iteration
- decompose() list[nemo_curator.stages.base.ProcessingStage] #
Decompose this composite stage into its constituent stages.
- get_description() str #
Get a description of this composite stage.