Common Crawl
Download and extract text from Common Crawl snapshots using Curator.
Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.
How it Works
Curator’s Common Crawl processing pipeline consists of four sequential stages:
- URL Generation: Generates WARC file URLs from Common Crawl’s index for the specified snapshot range
- Download: Downloads the compressed WARC files from Common Crawl’s servers (optionally using S3 for faster downloads)
- Iteration: Extracts individual records from WARC files and decodes HTML content
- Extraction: Performs language detection and extracts clean text using configurable HTML extraction algorithms
The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.
Before You Start
Choose your download method and ensure you have the prerequisites:
- HTTPS downloads (default): No AWS account required.
- S3 downloads (set
use_aws_to_download=True):- An AWS account with credentials configured (profile, environment, or instance role).
- Common Crawl’s S3 access uses Requester Pays; you incur charges for requests and data transfer.
s5cmdinstalled for fast S3 listing and copy operations:
Usage
Here’s how to create and run a Common Crawl processing pipeline:
For executor options and configuration, refer to Execution Backends.
Writing to Parquet
To write to Parquet files instead of JSONL, use ParquetWriter:
Parameters
Output Format
The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Customization Options
HTML Text Extraction Algorithms
Curator supports several HTML text extraction algorithms:
Configuring HTML Extractors
Language Processing
You can customize language detection and extraction by providing stop words for different languages:
Advanced Usage
Processing CC-NEWS Data
For Common Crawl News data, use the news crawl type with month-based snapshots:
See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.
Large-Scale Processing
For production workloads, consider these optimizations: