nemo_curator.stages.text.download.wikipedia.stage

View as Markdown

Module Contents

Classes

NameDescription
WikipediaDownloadExtractStageComposite stage for downloading and processing Wikipedia data.

API

class nemo_curator.stages.text.download.wikipedia.stage.WikipediaDownloadExtractStage(
language: str = 'en',
download_dir: str = './wikipedia_downloads',
dump_date: str | None = None,
wikidumps_index_prefix: str = 'https://dumps.wikimedia.org',
verbose: bool = False,
url_limit: int | None = None,
record_limit: int | None = None,
add_filename_column: bool | str = True,
log_frequency: int = 1000
)

Bases: DocumentDownloadExtractStage

Composite stage for downloading and processing Wikipedia data.

This pipeline:

  1. Generates Wikipedia dump URLs for the specified language and date
  2. Downloads Wikipedia .bz2 dump files
  3. Extracts articles from the dump files
  4. Cleans and extracts text from Wikipedia markup
downloader
extractor
= WikipediaExtractor(language=language)
iterator
name
= f'wikipedia_{self.language}_pipeline'
url_generator
nemo_curator.stages.text.download.wikipedia.stage.WikipediaDownloadExtractStage.decompose() -> list[nemo_curator.stages.base.ProcessingStage]

Decompose this composite stage into its constituent stages.

nemo_curator.stages.text.download.wikipedia.stage.WikipediaDownloadExtractStage.get_description() -> str

Get a description of this composite stage.