Download and extract text from Wikipedia Dumps using Curator.
Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.
The Wikipedia pipeline in Curator consists of four stages:
.bz2 dump files using wgetWikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.
Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:
brew install wgetsudo apt-get install wgetsudo yum install wgetHere’s how to download and extract Wikipedia data using Curator:
For executor options and configuration, refer to Execution Backends.
The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:
text: The cleaned main text content of the articletitle: The title of the Wikipedia articleid: Wikipedia’s unique identifier for the articleurl: The constructed Wikipedia URL for the articlelanguage: The language code of the articlesource_id: Identifier of the source dump fileIf you enable add_filename_column, the output includes an extra field file_name (or your custom column name).