Download Data#

Load text data from ArXiv, Common Crawl, Wikipedia, and custom sources using Curator.

Curator provides a task-centric pipeline for downloading and processing large-scale public text datasets. It runs on Ray and converts raw formats like Common Crawl’s .warc.gz into JSONL.

How it Works#

Curator uses a 4-step pipeline pattern where data flows through stages as tasks. Each step uses a ProcessingStage that transforms tasks according to Curator’s pipeline-based architecture.

Data sources provide composite stages that combine these steps into complete download-and-extract pipelines, producing DocumentBatch tasks for further processing.

Python

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

# Initialize Ray client
ray_client = RayClient()
ray_client.start()

# Create a pipeline for downloading Common Crawl data
pipeline = Pipeline(
    name="common_crawl_download",
    description="Download and process Common Crawl web archives"
)

# Add data loading stage
cc_stage = CommonCrawlDownloadExtractStage(
    start_snapshot="2020-50",
    end_snapshot="2020-50",
    download_dir="/tmp/cc_downloads",
    crawl_type="main",
    url_limit=10  # Limit for testing
)
pipeline.add_stage(cc_stage)

# Add writer stage to save as JSONL
writer = JsonlWriter(path="/output/folder")
pipeline.add_stage(writer)

# Execute pipeline
results = pipeline.run()

# Stop Ray client
ray_client.stop()