Curate TextLoad Data

Wikipedia

View as Markdown

Download and extract text from Wikipedia Dumps using Curator.

Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.

How it Works

The Wikipedia pipeline in Curator consists of four stages:

  1. URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
  2. Download: Downloads compressed .bz2 dump files using wget
  3. Iteration: Parses XML content and extracts individual articles
  4. Extraction: Cleans Wikipedia markup and converts to plain text

Before You Start

Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.

Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:

  • On macOS: brew install wget
  • On Ubuntu/Debian: sudo apt-get install wget
  • On CentOS/RHEL: sudo yum install wget

Usage

Here’s how to download and extract Wikipedia data using Curator:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
4from nemo_curator.stages.text.io.writer import JsonlWriter
5
6# Initialize Ray client
7ray_client = RayClient()
8ray_client.start()
9
10# Create and configure pipeline
11pipeline = Pipeline(
12 name="wikipedia_pipeline",
13 description="Download and process Wikipedia dumps"
14)
15
16# Create the Wikipedia processing stage
17wikipedia_stage = WikipediaDownloadExtractStage(
18 language="en",
19 download_dir="./wikipedia_downloads",
20 dump_date=None, # None uses latest dump
21 url_limit=5, # Optional: limit number of dump files (useful for testing)
22 record_limit=1000, # Optional: limit articles per dump file
23 verbose=True
24)
25pipeline.add_stage(wikipedia_stage)
26
27# Create writer stage to save results
28writer_stage = JsonlWriter(
29 path="./wikipedia_output"
30)
31pipeline.add_stage(writer_stage)
32
33# Execute the pipeline
34results = pipeline.run()
35
36# Stop Ray client
37ray_client.stop()

For executor options and configuration, refer to Execution Backends.

Parameters

ParameterTypeDefaultDescription
languagestr”en”Language code for Wikipedia dump (for example, en, es, fr). Most follow ISO 639‑1, with project-specific exceptions such as simple. Refer to Meta‑Wiki List of Wikipedia language editions for supported edition codes and List of ISO 639 language codes for general codes.
download_dirstr”./wikipedia_downloads”Directory to store downloaded .bz2 files
dump_dateOptional[str]NoneSpecific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump
wikidumps_index_prefixstrhttps://dumps.wikimedia.orgBase URL for Wikipedia dumps index
verboseboolFalseEnable verbose logging during download
url_limitOptional[int]NoneMaximum number of dump URLs to process (useful for testing)
record_limitOptional[int]NoneMaximum number of articles to extract per dump file
add_filename_columnbool | strTrueWhether to add source filename column to output; if str, uses it as the column name (default name: “file_name”)
log_frequencyint1000How often to log progress during article processing

Output Format

The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:

  • text: The cleaned main text content of the article
  • title: The title of the Wikipedia article
  • id: Wikipedia’s unique identifier for the article
  • url: The constructed Wikipedia URL for the article
  • language: The language code of the article
  • source_id: Identifier of the source dump file

If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).

Example Output Record

1{
2 "text": "Python is a high-level, general-purpose programming language...",
3 "title": "Python (programming language)",
4 "id": "23862",
5 "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
6 "language": "en",
7 "source_id": "enwiki-20240401-pages-articles-multistream1.xml"
8}