Common Crawl
Download and extract text from Common Crawl snapshots using Curator.
Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.
How it Works
Curator’s Common Crawl processing pipeline consists of four sequential stages:
- URL Generation: Generates WARC file URLs from Common Crawl’s index for the specified snapshot range
- Download: Downloads the compressed WARC files from Common Crawl’s servers (optionally using S3 for faster downloads)
- Iteration: Extracts individual records from WARC files and decodes HTML content
- Extraction: Performs language detection and extracts clean text using configurable HTML extraction algorithms
The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.
WARC Record Reader
For pipelines that already have WARC metadata (such as warc_filename, warc_record_offset, and warc_record_length columns from a CC Index lookup), use CommonCrawlWARCReader to fetch individual WARC records directly via byte-range requests — without downloading full WARC files.
CommonCrawlWARCReader supports two transport modes:
- HTTPS (default): Fetches records from
data.commoncrawl.orgusing therequestslibrary. No AWS credentials required. - S3: Fetches records from the
commoncrawlS3 bucket usingboto3range requests. Activate withuse_s3=Trueor by setting theCC_USE_S3=1environment variable. Credentials are resolved through boto3’s standard chain (environment variables,~/.aws/config, instance profiles).
Before You Start
Choose your download method and ensure you have the prerequisites:
- HTTPS downloads (default): No AWS account required.
- S3 downloads (set
use_aws_to_download=True):- An AWS account with credentials configured (profile, environment, or instance role).
- Common Crawl’s S3 access uses Requester Pays; you incur charges for requests and data transfer.
s5cmdinstalled for fast S3 listing and copy operations:
Usage
Here’s how to create and run a Common Crawl processing pipeline:
For executor options and configuration, refer to Execution Backends.
Writing to Parquet
To write to Parquet files instead of JSONL, use ParquetWriter:
Parameters
Output Format
The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Customization Options
HTML Text Extraction Algorithms
Curator supports several HTML text extraction algorithms:
Configuring HTML Extractors
Language Processing
You can customize language detection and extraction by providing stop words for different languages:
WARC Record Reader Usage
Use CommonCrawlWARCReader when your dataset already contains WARC metadata columns from a CC Index lookup:
WARC Record Reader Parameters
Environment Variables
You can configure CommonCrawlWARCReader S3 transport using environment variables instead of constructor parameters:
Advanced Usage
Processing CC-NEWS Data
For Common Crawl News data, use the news crawl type with month-based snapshots:
See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.
Large-Scale Processing
For production workloads, consider these optimizations:
Memory Management for Extraction
JusText extraction relies on lxml/libxml2, which can cause C-heap memory fragmentation during long-running jobs. Over many WARC files, this fragmentation causes resident memory to grow until workers run out of memory.
To mitigate this, Curator automatically sets extractor_max_calls_per_worker=2 when using JusTextExtractor. This restarts Ray Data worker processes every two tasks, reclaiming fragmented memory. You can override this value:
Worker recycling is only supported with the Ray Data executor and applies to task-based stages (not actor-based). For custom extraction stages that use C libraries prone to memory fragmentation, set max_calls_per_worker on DocumentIterateExtractStage directly.