nemo_curator.stages.text.download.wikipedia.stage
nemo_curator.stages.text.download.wikipedia.stage
Module Contents
Classes
API
Bases: DocumentDownloadExtractStage
Composite stage for downloading and processing Wikipedia data.
This pipeline:
- Generates Wikipedia dump URLs for the specified language and date
- Downloads Wikipedia .bz2 dump files
- Extracts articles from the dump files
- Cleans and extracts text from Wikipedia markup
downloader
extractor
iterator
name
url_generator
Decompose this composite stage into its constituent stages.
Get a description of this composite stage.