Download and extract text from Common Crawl snapshots using Curator.
Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.
Curator’s Common Crawl processing pipeline consists of four sequential stages:
The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.
For pipelines that already have WARC metadata (such as warc_filename, warc_record_offset, and warc_record_length columns from a CC Index lookup), use CommonCrawlWARCReader to fetch individual WARC records directly via byte-range requests — without downloading full WARC files.
CommonCrawlWARCReader supports two transport modes:
data.commoncrawl.org using the requests library. No AWS credentials required.commoncrawl S3 bucket using boto3 range requests. Activate with use_s3=True or by setting the CC_USE_S3=1 environment variable. Credentials are resolved through boto3’s standard chain (environment variables, ~/.aws/config, instance profiles).Choose your download method and ensure you have the prerequisites:
use_aws_to_download=True):
s5cmd installed for fast S3 listing and copy operations:Here’s how to create and run a Common Crawl processing pipeline:
For executor options and configuration, refer to Execution Backends.
To write to Parquet files instead of JSONL, use ParquetWriter:
The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Curator supports several HTML text extraction algorithms:
You can customize language detection and extraction by providing stop words for different languages:
Use CommonCrawlWARCReader when your dataset already contains WARC metadata columns from a CC Index lookup:
You can configure CommonCrawlWARCReader S3 transport using environment variables instead of constructor parameters:
For Common Crawl News data, use the news crawl type with month-based snapshots:
See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.
For production workloads, consider these optimizations:
JusText extraction relies on lxml/libxml2, which can cause C-heap memory fragmentation during long-running jobs. Over many WARC files, this fragmentation causes resident memory to grow until workers run out of memory.
To mitigate this, Curator automatically sets extractor_max_calls_per_worker=2 when using JusTextExtractor. This restarts Ray Data worker processes every two tasks, reclaiming fragmented memory. You can override this value:
Worker recycling is only supported with the Ray Data executor and applies to task-based stages (not actor-based). For custom extraction stages that use C libraries prone to memory fragmentation, set max_calls_per_worker on DocumentIterateExtractStage directly.