Download Data#

Load text data from ArXiv, Common Crawl, Wikipedia, and custom sources using Curator.

Curator provides a task-centric pipeline for downloading and processing large-scale public text datasets. It runs on Ray and converts raw formats like Common Crawl’s .warc.gz into JSONL.

How it Works#

Curator uses the 4-step pipeline pattern where data flows through stages as tasks. Each step uses a ProcessingStage that transforms tasks according to the pipeline-based architecture.

Data sources provide composite stages that combine these steps into complete download-extract pipelines, producing DocumentBatch tasks for further processing.

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

# Create a pipeline for downloading Common Crawl data
pipeline = Pipeline(
    name="common_crawl_download",
    description="Download and process Common Crawl web archives"
)

# Add data loading stage
cc_stage = CommonCrawlDownloadExtractStage(
    start_snapshot="2020-50",
    end_snapshot="2020-50",
    download_dir="/tmp/cc_downloads",
    crawl_type="main",
    url_limit=10  # Limit for testing
)
pipeline.add_stage(cc_stage)

# Add writer stage to save as JSONL
writer = JsonlWriter(path="/output/folder")
pipeline.add_stage(writer)

# Build and execute pipeline
pipeline.build()
results = pipeline.run()

Data Sources & File Formats#

Load data from public datasets and custom data sources using Curator stages.

Read Existing Data

Read existing JSONL and Parquet datasets using Curator’s reader stages

Read Existing Data
Common Crawl

Download and process web archive data from Common Crawl

Common Crawl
Wikipedia

Download and extract Wikipedia articles from Wikipedia dumps

Wikipedia
Custom Data

Read and process your own text datasets in standard formats

Custom Data Loading