Download and extract text from Common Crawl snapshots using Curator.
Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.
Curator’s Common Crawl processing pipeline consists of four sequential stages:
The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.
Choose your download method and ensure you have the prerequisites:
use_aws_to_download=True):
s5cmd installed for fast S3 listing and copy operations:Here’s how to create and run a Common Crawl processing pipeline:
For executor options and configuration, refer to Reference Execution Backends.
To write to Parquet files instead of JSONL, use ParquetWriter:
CommonCrawlDownloadExtractStage Parameters
The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:
Output Fields
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Curator supports several HTML text extraction algorithms:
Available HTML Extractors
You can customize language detection and extraction by providing stop words for different languages:
For Common Crawl News data, use the news crawl type with month-based snapshots:
See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.
For production workloads, consider these optimizations: