Download Data
Load text data from ArXiv, Common Crawl, Wikipedia, and custom sources using Curator.
Curator provides a task-centric pipeline for downloading and processing large-scale public text datasets. It runs on Ray and converts raw formats like Common Crawl’s .warc.gz into JSONL.
How it Works
Curator uses a 4-step pipeline pattern where data flows through stages as tasks. Each step uses a ProcessingStage that transforms tasks according to Curator’s pipeline-based architecture .
Data sources provide composite stages that combine these steps into complete download-and-extract pipelines, producing DocumentBatch tasks for further processing.
Python
Data Sources & File Formats
Load data from public datasets and custom data sources using Curator stages.
Read existing JSONL and Parquet datasets using Curator’s reader stages jsonl parquet
Download and extract web archive data from Common Crawl web-data warc html-extraction
Download and extract Wikipedia articles from Wikipedia dumps articles multilingual xml-dumps
Implement a download and extract pipeline for a custom data source jsonl parquet file-partitioning