Wikipedia#
Download and extract text from Wikipedia Dumps using Curator.
Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.
How it Works#
The Wikipedia pipeline in Curator consists of four stages:
URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
Download: Downloads compressed .bz2 dump files using
wget
Iteration: Parses XML content and extracts individual articles
Extraction: Cleans Wikipedia markup and converts to plain text
Before You Start#
Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/
for available dates.
Curator uses wget
to download Wikipedia dumps. You must have wget
installed on your system:
On macOS:
brew install wget
On Ubuntu/Debian:
sudo apt-get install wget
On CentOS/RHEL:
sudo yum install wget
Usage#
Here’s how to download and extract Wikipedia data using Curator:
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter
# Create the Wikipedia processing stage
wikipedia_stage = WikipediaDownloadExtractStage(
language="en",
download_dir="./wikipedia_downloads",
dump_date="20240401", # Optional: specific dump date (YYYYMMDD format)
url_limit=5, # Optional: limit number of dump files (useful for testing)
record_limit=1000, # Optional: limit articles per dump file
verbose=True
)
# Create writer stage to save results
writer_stage = JsonlWriter(
path="./wikipedia_output"
)
# Create and configure pipeline
pipeline = Pipeline(
name="wikipedia_pipeline",
description="Download and process Wikipedia dumps"
)
pipeline.add_stage(wikipedia_stage)
pipeline.add_stage(writer_stage)
# Execute the pipeline
results = pipeline.run()
For executor options and configuration, refer to Pipeline Execution Backends.
Multi-Language Processing#
You can process several languages by creating separate pipelines:
languages = ["en", "es", "fr", "de"]
for lang in languages:
# Create language-specific pipeline
wikipedia_stage = WikipediaDownloadExtractStage(
language=lang,
download_dir=f"./downloads/{lang}",
dump_date="20240401"
)
writer_stage = JsonlWriter(
path=f"./output/{lang}"
)
pipeline = Pipeline(name=f"wikipedia_{lang}")
pipeline.add_stage(wikipedia_stage)
pipeline.add_stage(writer_stage)
# Execute
results = pipeline.run()
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
“en” |
Language code for Wikipedia dump (for example, |
|
str |
“./wikipedia_downloads” |
Directory to store downloaded .bz2 files |
|
Optional[str] |
None |
Specific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump |
|
str |
“https://dumps.wikimedia.org” |
Base URL for Wikipedia dumps index |
|
bool |
False |
Enable verbose logging during download |
|
Optional[int] |
None |
Maximum number of dump URLs to process (useful for testing) |
|
Optional[int] |
None |
Maximum number of articles to extract per dump file |
|
bool | str |
True |
Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”) |
|
int |
1000 |
How often to log progress during article processing |
Known Limitations#
Parsing relies on mwparserfromhell
. Complex templates might not be fully rendered, so template-heavy pages can yield incomplete text. Customize the extractor if you need different behavior.
Output Format#
The processed Wikipedia articles become JSONL files, with each line containing a JSON object with these fields:
text
: The cleaned main text content of the articletitle
: The title of the Wikipedia articleid
: Wikipedia’s unique identifier for the articleurl
: The constructed Wikipedia URL for the articlelanguage
: The language code of the articlesource_id
: Identifier of the source dump file
If you enable add_filename_column
, the output includes an extra field file_name
(or your custom column name).
Example Output Record#
{
"text": "Python is a high-level, general-purpose programming language...",
"title": "Python (programming language)",
"id": "23862",
"url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
"language": "en",
"source_id": "enwiki-20240401-pages-articles-multistream1.xml"
}