Wikipedia#
Download and extract text from Wikipedia Dumps using Curator.
Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.
How it Works#
The Wikipedia pipeline in Curator consists of four stages:
URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
Download: Downloads compressed
.bz2dump files usingwgetIteration: Parses XML content and extracts individual articles
Extraction: Cleans Wikipedia markup and converts to plain text
Before You Start#
Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.
Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:
On macOS:
brew install wgetOn Ubuntu/Debian:
sudo apt-get install wgetOn CentOS/RHEL:
sudo yum install wget
Usage#
Here’s how to download and extract Wikipedia data using Curator:
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter
# Initialize Ray client
ray_client = RayClient()
ray_client.start()
# Create and configure pipeline
pipeline = Pipeline(
name="wikipedia_pipeline",
description="Download and process Wikipedia dumps"
)
# Create the Wikipedia processing stage
wikipedia_stage = WikipediaDownloadExtractStage(
language="en",
download_dir="./wikipedia_downloads",
dump_date=None, # None uses latest dump
url_limit=5, # Optional: limit number of dump files (useful for testing)
record_limit=1000, # Optional: limit articles per dump file
verbose=True
)
pipeline.add_stage(wikipedia_stage)
# Create writer stage to save results
writer_stage = JsonlWriter(
path="./wikipedia_output"
)
pipeline.add_stage(writer_stage)
# Execute the pipeline
results = pipeline.run()
# Stop Ray client
ray_client.stop()
For executor options and configuration, refer to Pipeline Execution Backends.
Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
“en” |
Language code for Wikipedia dump (for example, |
|
str |
“./wikipedia_downloads” |
Directory to store downloaded |
|
Optional[str] |
None |
Specific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump |
|
str |
“https://dumps.wikimedia.org” |
Base URL for Wikipedia dumps index |
|
bool |
False |
Enable verbose logging during download |
|
Optional[int] |
None |
Maximum number of dump URLs to process (useful for testing) |
|
Optional[int] |
None |
Maximum number of articles to extract per dump file |
|
bool | str |
True |
Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”) |
|
int |
1000 |
How often to log progress during article processing |
Output Format#
The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:
text: The cleaned main text content of the articletitle: The title of the Wikipedia articleid: Wikipedia’s unique identifier for the articleurl: The constructed Wikipedia URL for the articlelanguage: The language code of the articlesource_id: Identifier of the source dump file
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Example Output Record#
{
"text": "Python is a high-level, general-purpose programming language...",
"title": "Python (programming language)",
"id": "23862",
"url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
"language": "en",
"source_id": "enwiki-20240401-pages-articles-multistream1.xml"
}