Wikipedia#
Download and extract text from Wikipedia Dumps using NeMo Curator utilities.
Wikipedia regularly releases dumps of all its content, which include articles, talk pages, user pages, and more. These dumps are available in various formats, including XML and SQL.
How it Works#
NeMo Curator simplifies the process of:
Downloading the latest Wikipedia dump
Extracting the article content
Converting the content to a usable format for language model training
Before You Start#
NeMo Curator uses wget
to download Wikipedia dumps. You must have wget
installed on your system:
On macOS:
brew install wget
On Ubuntu/Debian:
sudo apt-get install wget
On CentOS/RHEL:
sudo yum install wget
Usage#
Here’s how to download and extract Wikipedia data using NeMo Curator:
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.download import download_wikipedia
# Initialize a Dask client
client = get_client(cluster_type="cpu")
# Download and extract Wikipedia
wikipedia_dataset = download_wikipedia(
output_path="/extracted/output/folder",
dump_date="20240401" # Optional: specific dump date
)
# The dataset is now available as a DocumentDataset object
print(f"Downloaded {len(wikipedia_dataset)} articles")
print(wikipedia_dataset.head())
# Write the dataset to disk as JSONL files
wikipedia_dataset.to_json(output_path="/path/to/output/files")
NeMo Curator provides a CLI for downloading Wikipedia data.
Step 1: Generate Wikipedia URLs
First, generate a list of Wikipedia dump URLs for the desired language:
get_wikipedia_urls \
--language=en \
--output-url-file=./wikipedia_urls.txt
Step 2: Create Configuration File
Create a configuration file (wikipedia_builder.yaml
):
download_module: nemo_curator.download.wikipedia.WikipediaDownloader
download_params: {}
iterator_module: nemo_curator.download.wikipedia.WikipediaIterator
iterator_params:
language: 'en'
extract_module: nemo_curator.download.wikipedia.WikipediaExtractor
extract_params:
language: 'en'
format:
text: str
title: str
id: str
url: str
language: str
source_id: str
Step 3: Run Download and Extraction
download_and_extract \
--input-url-file=./wikipedia_urls.txt \
--builder-config-file=./wikipedia_builder.yaml \
--output-json-dir=/datasets/wikipedia/json
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
Required |
Path where the extracted files will be placed |
|
Optional[str] |
None |
Parameter to specify a particular Wikipedia dump date. The format must be “YYYYMMDD” (for example, “20250401” for April 1, 2025). Wikipedia creates new dumps approximately twice a month (around the 1st and 20th). You can find available dump dates by visiting https://dumps.wikimedia.org/enwiki/. If not specified, NeMo Curator will automatically use the latest available dump. |
|
str |
“en” |
Language code to download (for example, “en” for English) |
|
Optional[int] |
None |
Parameter to limit the number of URLs downloaded (useful for testing) |
If no dump_date
is specified, NeMo Curator will download the latest available dump.
Output Format#
The extracted Wikipedia articles are stored in .jsonl
files, with each line containing a JSON object with fields:
text
: The main text content of the articleid
: A unique identifier for the articletitle
: The title of the Wikipedia articleurl
: The URL of the Wikipedia articlelanguage
: The language code of the articlesource_id
: The source file identifierfile_name
: The output file name (when usingwrite_to_filename=True
)