Wikipedia
Download and extract text from Wikipedia Dumps using Curator.
Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.
How it Works
The Wikipedia pipeline in Curator consists of four stages:
- URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
- Download: Downloads compressed
.bz2dump files usingwget - Iteration: Parses XML content and extracts individual articles
- Extraction: Cleans Wikipedia markup and converts to plain text
Before You Start
Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.
Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:
- On macOS:
brew install wget - On Ubuntu/Debian:
sudo apt-get install wget - On CentOS/RHEL:
sudo yum install wget
Usage
Here’s how to download and extract Wikipedia data using Curator:
For executor options and configuration, refer to Execution Backends.
Parameters
Output Format
The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:
text: The cleaned main text content of the articletitle: The title of the Wikipedia articleid: Wikipedia’s unique identifier for the articleurl: The constructed Wikipedia URL for the articlelanguage: The language code of the articlesource_id: Identifier of the source dump file
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).