For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
      • ArXiv
      • Common Crawl
      • Custom Sources
      • Nemotron-Parse PDF Pipeline
      • Read Existing Data
      • Wikipedia
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Before You Start
  • Usage
  • Parameters
  • Output Format
  • Example Output Record
Curate TextLoad Data

Wikipedia

||View as Markdown|
Previous

Read Existing Data

Next

Overview

Download and extract text from Wikipedia Dumps using Curator.

Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.

How it Works

The Wikipedia pipeline in Curator consists of four stages:

  1. URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
  2. Download: Downloads compressed .bz2 dump files using wget
  3. Iteration: Parses XML content and extracts individual articles
  4. Extraction: Cleans Wikipedia markup and converts to plain text

Before You Start

Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.

Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:

  • On macOS: brew install wget
  • On Ubuntu/Debian: sudo apt-get install wget
  • On CentOS/RHEL: sudo yum install wget

Usage

Here’s how to download and extract Wikipedia data using Curator:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
4from nemo_curator.stages.text.io.writer import JsonlWriter
5
6# Initialize Ray client
7ray_client = RayClient()
8ray_client.start()
9
10# Create and configure pipeline
11pipeline = Pipeline(
12 name="wikipedia_pipeline",
13 description="Download and process Wikipedia dumps"
14)
15
16# Create the Wikipedia processing stage
17wikipedia_stage = WikipediaDownloadExtractStage(
18 language="en",
19 download_dir="./wikipedia_downloads",
20 dump_date=None, # None uses latest dump
21 url_limit=5, # Optional: limit number of dump files (useful for testing)
22 record_limit=1000, # Optional: limit articles per dump file
23 verbose=True
24)
25pipeline.add_stage(wikipedia_stage)
26
27# Create writer stage to save results
28writer_stage = JsonlWriter(
29 path="./wikipedia_output"
30)
31pipeline.add_stage(writer_stage)
32
33# Execute the pipeline
34results = pipeline.run()
35
36# Stop Ray client
37ray_client.stop()

For executor options and configuration, refer to Execution Backends.

Parameters

ParameterTypeDefaultDescription
languagestr”en”Language code for Wikipedia dump (for example, en, es, fr). Most follow ISO 639‑1, with project-specific exceptions such as simple. Refer to Meta‑Wiki List of Wikipedia language editions for supported edition codes and List of ISO 639 language codes for general codes.
download_dirstr”./wikipedia_downloads”Directory to store downloaded .bz2 files
dump_dateOptional[str]NoneSpecific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump
wikidumps_index_prefixstr”https://dumps.wikimedia.org”Base URL for Wikipedia dumps index
verboseboolFalseEnable verbose logging during download
url_limitOptional[int]NoneMaximum number of dump URLs to process (useful for testing)
record_limitOptional[int]NoneMaximum number of articles to extract per dump file
add_filename_columnbool | strTrueWhether to add source filename column to output; if str, uses it as the column name (default name: “file_name”)
log_frequencyint1000How often to log progress during article processing

Output Format

The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:

  • text: The cleaned main text content of the article
  • title: The title of the Wikipedia article
  • id: Wikipedia’s unique identifier for the article
  • url: The constructed Wikipedia URL for the article
  • language: The language code of the article
  • source_id: Identifier of the source dump file

If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).

Example Output Record

1{
2 "text": "Python is a high-level, general-purpose programming language...",
3 "title": "Python (programming language)",
4 "id": "23862",
5 "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
6 "language": "en",
7 "source_id": "enwiki-20240401-pages-articles-multistream1.xml"
8}