Curate TextLoad Data

Common Crawl

View as Markdown

Download and extract text from Common Crawl snapshots using Curator.

Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.

How it Works

Curator’s Common Crawl processing pipeline consists of four sequential stages:

  1. URL Generation: Generates WARC file URLs from Common Crawl’s index for the specified snapshot range
  2. Download: Downloads the compressed WARC files from Common Crawl’s servers (optionally using S3 for faster downloads)
  3. Iteration: Extracts individual records from WARC files and decodes HTML content
  4. Extraction: Performs language detection and extracts clean text using configurable HTML extraction algorithms

The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.

WARC Record Reader

For pipelines that already have WARC metadata (such as warc_filename, warc_record_offset, and warc_record_length columns from a CC Index lookup), use CommonCrawlWARCReader to fetch individual WARC records directly via byte-range requests — without downloading full WARC files.

CommonCrawlWARCReader supports two transport modes:

  • HTTPS (default): Fetches records from data.commoncrawl.org using the requests library. No AWS credentials required.
  • S3: Fetches records from the commoncrawl S3 bucket using boto3 range requests. Activate with use_s3=True or by setting the CC_USE_S3=1 environment variable. Credentials are resolved through boto3’s standard chain (environment variables, ~/.aws/config, instance profiles).

Before You Start

Choose your download method and ensure you have the prerequisites:

  • HTTPS downloads (default): No AWS account required.
  • S3 downloads (set use_aws_to_download=True):
    • An AWS account with credentials configured (profile, environment, or instance role).
    • Common Crawl’s S3 access uses Requester Pays; you incur charges for requests and data transfer.
    • s5cmd installed for fast S3 listing and copy operations:
$# Install s5cmd for faster S3 downloads
$pip install s5cmd

Usage

Here’s how to create and run a Common Crawl processing pipeline:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
4from nemo_curator.stages.text.io.writer import JsonlWriter
5
6def main():
7 # Initialize Ray client
8 ray_client = RayClient()
9 ray_client.start()
10
11 # Create pipeline
12 pipeline = Pipeline(
13 name="common_crawl_pipeline",
14 description="Download and process Common Crawl data"
15 )
16
17 # Add Common Crawl processing stage
18 cc_stage = CommonCrawlDownloadExtractStage(
19 start_snapshot="2020-50", # YYYY-WW format for CC-MAIN
20 end_snapshot="2020-50",
21 download_dir="./cc_downloads",
22 crawl_type="main", # or "news"
23 use_aws_to_download=True, # Faster S3 downloads (requires s5cmd)
24 url_limit=10, # Limit number of WARC files for testing
25 record_limit=1000, # Limit records per WARC file
26 )
27 pipeline.add_stage(cc_stage)
28
29 # Add output writer stage
30 writer = JsonlWriter("./cc_output")
31 pipeline.add_stage(writer)
32
33 # Run pipeline
34 results = pipeline.run()
35
36 # Stop Ray client
37 ray_client.stop()
38
39if __name__ == "__main__":
40 main()

For executor options and configuration, refer to Execution Backends.

Writing to Parquet

To write to Parquet files instead of JSONL, use ParquetWriter:

1from nemo_curator.stages.text.io.writer import ParquetWriter
2
3# Replace the JSONL writer with ParquetWriter
4writer = ParquetWriter("./cc_output_parquet")
5pipeline.add_stage(writer)

Parameters

ParameterTypeDescriptionDefault
start_snapshotstrFirst snapshot to include (format: “YYYY-WW” for main, “YYYY-MM” for news). Not every year and week has a snapshot; refer to the official list at https://data.commoncrawl.org/.Required
end_snapshotstrLast snapshot to include (same format as start_snapshot). Ensure your range includes at least one valid snapshot.Required
download_dirstrDirectory to store downloaded WARC filesRequired
crawl_typeLiteral[“main”, “news”]Whether to use CC-MAIN or CC-NEWS dataset”main”
html_extractionHTMLExtractorAlgorithm | str | NoneText extraction algorithm to use. Defaults to JusTextExtractor() if not specified.JusTextExtractor() if not specified
html_extraction_kwargsdict | NoneAdditional arguments for the HTML extractor. Ignored when html_extraction is a concrete extractor object (for example, JusTextExtractor()); pass kwargs to the extractor constructor instead. When html_extraction is a string (“justext”, “resiliparse”, or “trafilatura”), kwargs are forwarded.None
stop_listsdict[str, frozenset[str]] | NoneLanguage-specific stop words for text quality assessment. If not provided, Curator uses jusText defaults with additional support for Thai, Chinese, and Japanese languages.None
use_aws_to_downloadboolUse S3 downloads via s5cmd instead of HTTPS (requires s5cmd installation)False
verboseboolEnable verbose logging for download operationsFalse
url_limitint | NoneMaximum number of WARC files to download (useful for testing)None
record_limitint | NoneMaximum number of records to extract per WARC fileNone
add_filename_columnbool | strWhether to add source filename column to output; if str, uses it as the column name (default name: “file_name”)True
extractor_max_calls_per_workerint | NoneRestart Ray Data worker processes after this many tasks to mitigate memory fragmentation. Auto-set to 2 for JusTextExtractor.Auto (2 for jusText, None otherwise)

Output Format

The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:

1{
2 "url": "http://example.com/page.html",
3 "warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac",
4 "source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz",
5 "language": "ENGLISH",
6 "text": "Extracted web page content..."
7}
FieldDescription
urlOriginal URL of the web page
warc_idUnique identifier for the WARC record
source_idName of the source WARC file
languageDetected language of the content (e.g., “ENGLISH”, “SPANISH”)
textExtracted and cleaned text content

If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).

Customization Options

HTML Text Extraction Algorithms

Curator supports several HTML text extraction algorithms:

ExtractorLibrary
JusTextExtractorjusText
ResiliparseExtractorResiliparse
TrafilaturaExtractorTrafilatura

Configuring HTML Extractors

1from nemo_curator.stages.text.download.html_extractors import ResiliparseExtractor
2from nemo_curator.stages.text.download.html_extractors import TrafilaturaExtractor
3
4# Use Resiliparse for extraction
5cc_stage = CommonCrawlDownloadExtractStage(
6 start_snapshot="2020-50",
7 end_snapshot="2020-50",
8 download_dir="./downloads",
9 html_extraction=ResiliparseExtractor(
10 required_stopword_density=0.25,
11 main_content=True
12 )
13)
14
15# Or use Trafilatura with custom parameters
16cc_stage = CommonCrawlDownloadExtractStage(
17 start_snapshot="2020-50",
18 end_snapshot="2020-50",
19 download_dir="./downloads",
20 html_extraction=TrafilaturaExtractor(
21 min_extracted_size=200,
22 max_repetitions=3
23 )
24)

Language Processing

You can customize language detection and extraction by providing stop words for different languages:

1# Define custom stop words for specific languages
2stop_lists = {
3 "ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"]),
4 "SPANISH": frozenset(["el", "la", "de", "que", "y", "en", "un", "es", "se", "no"])
5}
6
7cc_stage = CommonCrawlDownloadExtractStage(
8 start_snapshot="2020-50",
9 end_snapshot="2020-50",
10 download_dir="./downloads",
11 stop_lists=stop_lists
12)

WARC Record Reader Usage

Use CommonCrawlWARCReader when your dataset already contains WARC metadata columns from a CC Index lookup:

1from nemo_curator.stages.text.download.common_crawl.download import CommonCrawlWARCReader
2
3# HTTPS transport (default)
4warc_reader = CommonCrawlWARCReader(
5 warc_filename_col="warc_filename",
6 warc_record_offset_col="warc_record_offset",
7 warc_record_length_col="warc_record_length",
8 max_workers=16,
9)
10
11# S3 transport
12warc_reader = CommonCrawlWARCReader(
13 warc_filename_col="warc_filename",
14 warc_record_offset_col="warc_record_offset",
15 warc_record_length_col="warc_record_length",
16 use_s3=True,
17 max_workers=16,
18)

WARC Record Reader Parameters

ParameterTypeDescriptionDefault
warc_filename_colstrColumn name containing the WARC filename"warc_filename"
warc_record_offset_colstrColumn name containing the byte offset"warc_record_offset"
warc_record_length_colstrColumn name containing the record length"warc_record_length"
binary_content_colstrOutput column name for fetched content"binary_content"
drop_failedboolDrop rows where the fetch failedTrue
max_workersintNumber of parallel threads for fetching16
timeoutintRequest timeout in seconds30
max_retriesintNumber of retries for failed requests3
use_s3bool | NoneUse S3 transport instead of HTTPS. If None, reads the CC_USE_S3 environment variable (accepted values: 1, true, yes).None
s3_bucketstr | NoneS3 bucket name. Falls back to the CC_S3_BUCKET environment variable, then "commoncrawl".None
s3_key_prefixstr | NonePrefix to strip from warc_filename when building the S3 object key. Falls back to the CC_S3_KEY_PREFIX environment variable.None

Environment Variables

You can configure CommonCrawlWARCReader S3 transport using environment variables instead of constructor parameters:

VariableDescriptionExample
CC_USE_S3Enable S3 transport1, true, yes
CC_S3_BUCKETOverride the S3 bucket namecommoncrawl
CC_S3_KEY_PREFIXPrefix to strip from warc_filename for S3 key constructioncrawl-data/

Advanced Usage

Processing CC-NEWS Data

For Common Crawl News data, use the news crawl type with month-based snapshots:

1cc_stage = CommonCrawlDownloadExtractStage(
2 start_snapshot="2020-08", # YYYY-MM format for CC-NEWS
3 end_snapshot="2020-10",
4 download_dir="./news_downloads",
5 crawl_type="news" # Use CC-NEWS instead of CC-MAIN
6)

See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.

Large-Scale Processing

For production workloads, consider these optimizations:

1cc_stage = CommonCrawlDownloadExtractStage(
2 start_snapshot="2020-50",
3 end_snapshot="2020-50",
4 download_dir="/fast_storage/cc_downloads",
5 use_aws_to_download=True, # Faster S3 downloads
6 verbose=False, # Reduce logging overhead
7 # Remove limits for full processing
8 # url_limit=None,
9 # record_limit=None
10)

Memory Management for Extraction

JusText extraction relies on lxml/libxml2, which can cause C-heap memory fragmentation during long-running jobs. Over many WARC files, this fragmentation causes resident memory to grow until workers run out of memory.

To mitigate this, Curator automatically sets extractor_max_calls_per_worker=2 when using JusTextExtractor. This restarts Ray Data worker processes every two tasks, reclaiming fragmented memory. You can override this value:

1# Increase recycling frequency for very memory-constrained environments
2cc_stage = CommonCrawlDownloadExtractStage(
3 start_snapshot="2020-50",
4 end_snapshot="2020-50",
5 download_dir="./downloads",
6 extractor_max_calls_per_worker=1, # Recycle after every task
7)
8
9# Disable worker recycling (not recommended for large jobs with jusText)
10cc_stage = CommonCrawlDownloadExtractStage(
11 start_snapshot="2020-50",
12 end_snapshot="2020-50",
13 download_dir="./downloads",
14 extractor_max_calls_per_worker=None, # No recycling
15)

Worker recycling is only supported with the Ray Data executor and applies to task-based stages (not actor-based). For custom extraction stages that use C libraries prone to memory fragmentation, set max_calls_per_worker on DocumentIterateExtractStage directly.