Wikipedia#

Download and extract text from Wikipedia Dumps using Curator.

Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.

How it Works#

The Wikipedia pipeline in Curator consists of four stages:

  1. URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date

  2. Download: Downloads compressed .bz2 dump files using wget

  3. Iteration: Parses XML content and extracts individual articles

  4. Extraction: Cleans Wikipedia markup and converts to plain text

Before You Start#

Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.

Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:

  • On macOS: brew install wget

  • On Ubuntu/Debian: sudo apt-get install wget

  • On CentOS/RHEL: sudo yum install wget


Usage#

Here’s how to download and extract Wikipedia data using Curator:

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

# Initialize Ray client
ray_client = RayClient()
ray_client.start()

# Create and configure pipeline
pipeline = Pipeline(
    name="wikipedia_pipeline",
    description="Download and process Wikipedia dumps"
)

# Create the Wikipedia processing stage
wikipedia_stage = WikipediaDownloadExtractStage(
    language="en",
    download_dir="./wikipedia_downloads",
    dump_date=None,        # None uses latest dump
    url_limit=5,           # Optional: limit number of dump files (useful for testing)
    record_limit=1000,     # Optional: limit articles per dump file
    verbose=True
)
pipeline.add_stage(wikipedia_stage)

# Create writer stage to save results
writer_stage = JsonlWriter(
    path="./wikipedia_output"
)
pipeline.add_stage(writer_stage)

# Execute the pipeline
results = pipeline.run()

# Stop Ray client
ray_client.stop()

For executor options and configuration, refer to Pipeline Execution Backends.

Parameters#

Table 8 WikipediaDownloadExtractStage Parameters#

Parameter

Type

Default

Description

language

str

“en”

Language code for Wikipedia dump (for example, en, es, fr). Most follow ISO 639‑1, with project-specific exceptions such as simple. Refer to Meta‑Wiki List of Wikipedia language editions for supported edition codes and List of ISO 639 language codes for general codes.

download_dir

str

“./wikipedia_downloads”

Directory to store downloaded .bz2 files

dump_date

Optional[str]

None

Specific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump

wikidumps_index_prefix

str

“https://dumps.wikimedia.org”

Base URL for Wikipedia dumps index

verbose

bool

False

Enable verbose logging during download

url_limit

Optional[int]

None

Maximum number of dump URLs to process (useful for testing)

record_limit

Optional[int]

None

Maximum number of articles to extract per dump file

add_filename_column

bool | str

True

Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”)

log_frequency

int

1000

How often to log progress during article processing

Output Format#

The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:

  • text: The cleaned main text content of the article

  • title: The title of the Wikipedia article

  • id: Wikipedia’s unique identifier for the article

  • url: The constructed Wikipedia URL for the article

  • language: The language code of the article

  • source_id: Identifier of the source dump file

If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).

Example Output Record#

{
  "text": "Python is a high-level, general-purpose programming language...",
  "title": "Python (programming language)",
  "id": "23862",
  "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
  "language": "en",
  "source_id": "enwiki-20240401-pages-articles-multistream1.xml"
}