***

description: Download and extract text from Wikipedia dumps using Curator.
categories:

* how-to-guides
  tags:
* wikipedia
* dumps
* multilingual
* articles
* data-loading
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: how-to
  modality: text-only

***

# Wikipedia

Download and extract text from [Wikipedia Dumps](https://dumps.wikimedia.org/backup-index.html) using Curator.

Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.

## How it Works

The Wikipedia pipeline in Curator consists of four stages:

1. **URL Generation**: Automatically discovers Wikipedia dump URLs for the specified language and date
2. **Download**: Downloads compressed `.bz2` dump files using `wget`
3. **Iteration**: Parses XML content and extracts individual articles
4. **Extraction**: Cleans Wikipedia markup and converts to plain text

## Before You Start

Wikipedia publishes new dumps around the **first** and **twentieth** of each month. Refer to the English Wikipedia dumps index at [https://dumps.wikimedia.org/enwiki/](https://dumps.wikimedia.org/enwiki/) for available dates.

Curator uses `wget` to download Wikipedia dumps. You must have `wget` installed on your system:

* **On macOS**:  `brew install wget`
* **On Ubuntu/Debian**: `sudo apt-get install wget`
* **On CentOS/RHEL**:  `sudo yum install wget`

***

## Usage

Here's how to download and extract Wikipedia data using Curator:

```python
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

# Initialize Ray client
ray_client = RayClient()
ray_client.start()

# Create and configure pipeline
pipeline = Pipeline(
    name="wikipedia_pipeline",
    description="Download and process Wikipedia dumps"
)

# Create the Wikipedia processing stage
wikipedia_stage = WikipediaDownloadExtractStage(
    language="en",
    download_dir="./wikipedia_downloads",
    dump_date=None,        # None uses latest dump
    url_limit=5,           # Optional: limit number of dump files (useful for testing)
    record_limit=1000,     # Optional: limit articles per dump file
    verbose=True
)
pipeline.add_stage(wikipedia_stage)

# Create writer stage to save results
writer_stage = JsonlWriter(
    path="./wikipedia_output"
)
pipeline.add_stage(writer_stage)

# Execute the pipeline
results = pipeline.run()

# Stop Ray client
ray_client.stop()
```

For executor options and configuration, refer to [Execution Backends](/reference/infra/execution-backends).

### Parameters

| Parameter                | Type           | Default                                                      | Description                                                                                                                                                                                                                                                                                                                                                                                                |
| ------------------------ | -------------- | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `language`               | str            | "en"                                                         | Language code for Wikipedia dump (for example, `en`, `es`, `fr`). Most follow ISO 639‑1, with project-specific exceptions such as `simple`. Refer to Meta‑Wiki [List of Wikipedia language editions](https://meta.wikimedia.org/wiki/List_of_Wikipedias) for supported edition codes and [List of ISO 639 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) for general codes. |
| `download_dir`           | str            | "./wikipedia\_downloads"                                     | Directory to store downloaded `.bz2` files                                                                                                                                                                                                                                                                                                                                                                 |
| `dump_date`              | Optional\[str] | None                                                         | Specific dump date in "YYYYMMDD" format (for example, "20240401"). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump                                                                                                                                                                                                                               |
| `wikidumps_index_prefix` | str            | "[https://dumps.wikimedia.org](https://dumps.wikimedia.org)" | Base URL for Wikipedia dumps index                                                                                                                                                                                                                                                                                                                                                                         |
| `verbose`                | bool           | False                                                        | Enable verbose logging during download                                                                                                                                                                                                                                                                                                                                                                     |
| `url_limit`              | Optional\[int] | None                                                         | Maximum number of dump URLs to process (useful for testing)                                                                                                                                                                                                                                                                                                                                                |
| `record_limit`           | Optional\[int] | None                                                         | Maximum number of articles to extract per dump file                                                                                                                                                                                                                                                                                                                                                        |
| `add_filename_column`    | bool \| str    | True                                                         | Whether to add source filename column to output; if str, uses it as the column name (default name: "file\_name")                                                                                                                                                                                                                                                                                           |
| `log_frequency`          | int            | 1000                                                         | How often to log progress during article processing                                                                                                                                                                                                                                                                                                                                                        |

## Output Format

The processed Wikipedia articles become `DocumentBatch` objects, with each line containing the following fields:

* `text`: The cleaned main text content of the article
* `title`: The title of the Wikipedia article
* `id`: Wikipedia's unique identifier for the article
* `url`: The constructed Wikipedia URL for the article
* `language`: The language code of the article
* `source_id`: Identifier of the source dump file

If you enable `add_filename_column`, the output includes an extra field `file_name` (or your custom column name).

### Example Output Record

```json
{
  "text": "Python is a high-level, general-purpose programming language...",
  "title": "Python (programming language)",
  "id": "23862",
  "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
  "language": "en",
  "source_id": "enwiki-20240401-pages-articles-multistream1.xml"
}
```
