Common Crawl | NeMo Curator

Download and extract text from Common Crawl snapshots using Curator.

Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.

How it Works

Curator’s Common Crawl processing pipeline consists of four sequential stages:

URL Generation: Generates WARC file URLs from Common Crawl’s index for the specified snapshot range
Download: Downloads the compressed WARC files from Common Crawl’s servers (optionally using S3 for faster downloads)
Iteration: Extracts individual records from WARC files and decodes HTML content
Extraction: Performs language detection and extracts clean text using configurable HTML extraction algorithms

The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.

Before You Start

Choose your download method and ensure you have the prerequisites:

HTTPS downloads (default): No AWS account required.
S3 downloads (set use_aws_to_download=True):
- An AWS account with credentials configured (profile, environment, or instance role).
- Common Crawl’s S3 access uses Requester Pays; you incur charges for requests and data transfer.
- s5cmd installed for fast S3 listing and copy operations:

$ # Install s5cmd for faster S3 downloads
$ pip install s5cmd

Usage

Here’s how to create and run a Common Crawl processing pipeline:

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.pipeline import Pipeline
3 from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
4 from nemo_curator.stages.text.io.writer import JsonlWriter
5 
6 def main():
7     # Initialize Ray client
8     ray_client = RayClient()
9     ray_client.start()
10 
11     # Create pipeline
12     pipeline = Pipeline(
13         name="common_crawl_pipeline",
14         description="Download and process Common Crawl data"
15     )
16 
17     # Add Common Crawl processing stage
18     cc_stage = CommonCrawlDownloadExtractStage(
19         start_snapshot="2020-50",  # YYYY-WW format for CC-MAIN
20         end_snapshot="2020-50",
21         download_dir="./cc_downloads",
22         crawl_type="main",  # or "news"
23         use_aws_to_download=True,  # Faster S3 downloads (requires s5cmd)
24         url_limit=10,  # Limit number of WARC files for testing
25         record_limit=1000,  # Limit records per WARC file
26     )
27     pipeline.add_stage(cc_stage)
28 
29     # Add output writer stage
30     writer = JsonlWriter("./cc_output")
31     pipeline.add_stage(writer)
32 
33     # Run pipeline
34     results = pipeline.run()
35 
36     # Stop Ray client
37     ray_client.stop()
38 
39 if __name__ == "__main__":
40     main()

For executor options and configuration, refer to Reference Execution Backends.

Writing to Parquet

To write to Parquet files instead of JSONL, use ParquetWriter:

1 from nemo_curator.stages.text.io.writer import ParquetWriter
2 
3 # Replace the JSONL writer with ParquetWriter
4 writer = ParquetWriter("./cc_output_parquet")
5 pipeline.add_stage(writer)

Parameters

CommonCrawlDownloadExtractStage Parameters

Parameter	Type	Description	Default
`start_snapshot`	str	First snapshot to include (format: “YYYY-WW” for main, “YYYY-MM” for news). Not every year and week has a snapshot; refer to the official list at https://data.commoncrawl.org/.	Required
`end_snapshot`	str	Last snapshot to include (same format as `start_snapshot`). Ensure your range includes at least one valid snapshot.	Required
`download_dir`	str	Directory to store downloaded WARC files	Required
`crawl_type`	Literal[“main”, “news”]	Whether to use CC-MAIN or CC-NEWS dataset	”main”
`html_extraction`	HTMLExtractorAlgorithm	str	None
`html_extraction_kwargs`	dict	None	Additional arguments for the HTML extractor. Ignored when `html_extraction` is a concrete extractor object (for example, `JusTextExtractor()`); pass kwargs to the extractor constructor instead. When `html_extraction` is a string (“justext”, “resiliparse”, or “trafilatura”), kwargs are forwarded.
`stop_lists`	dict[str, frozenset[str]]	None	Language-specific stop words for text quality assessment. If not provided, Curator uses jusText defaults with additional support for Thai, Chinese, and Japanese languages.
`use_aws_to_download`	bool	Use S3 downloads via s5cmd instead of HTTPS (requires s5cmd installation)	False
`verbose`	bool	Enable verbose logging for download operations	False
`url_limit`	int	None	Maximum number of WARC files to download (useful for testing)
`record_limit`	int	None	Maximum number of records to extract per WARC file
`add_filename_column`	bool	str	Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”)

Output Format

The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:

1 {
2   "url": "http://example.com/page.html",
3   "warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac", 
4   "source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz",
5   "language": "ENGLISH",
6   "text": "Extracted web page content..."
7 }

Output Fields

Field	Description
`url`	Original URL of the web page
`warc_id`	Unique identifier for the WARC record
`source_id`	Name of the source WARC file
`language`	Detected language of the content (e.g., “ENGLISH”, “SPANISH”)
`text`	Extracted and cleaned text content

If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).

Customization Options

HTML Text Extraction Algorithms

Curator supports several HTML text extraction algorithms:

Available HTML Extractors

Extractor	Library
`JusTextExtractor`	jusText
`ResiliparseExtractor`	Resiliparse
`TrafilaturaExtractor`	Trafilatura

Configuring HTML Extractors

1 from nemo_curator.stages.text.download.html_extractors import ResiliparseExtractor
2 from nemo_curator.stages.text.download.html_extractors import TrafilaturaExtractor
3 
4 # Use Resiliparse for extraction
5 cc_stage = CommonCrawlDownloadExtractStage(
6     start_snapshot="2020-50",
7     end_snapshot="2020-50",
8     download_dir="./downloads",
9     html_extraction=ResiliparseExtractor(
10         required_stopword_density=0.25,
11         main_content=True
12     )
13 )
14 
15 # Or use Trafilatura with custom parameters
16 cc_stage = CommonCrawlDownloadExtractStage(
17     start_snapshot="2020-50", 
18     end_snapshot="2020-50",
19     download_dir="./downloads",
20     html_extraction=TrafilaturaExtractor(
21         min_extracted_size=200,
22         max_repetitions=3
23     )
24 )

Language Processing

You can customize language detection and extraction by providing stop words for different languages:

1 # Define custom stop words for specific languages
2 stop_lists = {
3     "ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"]),
4     "SPANISH": frozenset(["el", "la", "de", "que", "y", "en", "un", "es", "se", "no"])
5 }
6 
7 cc_stage = CommonCrawlDownloadExtractStage(
8     start_snapshot="2020-50",
9     end_snapshot="2020-50", 
10     download_dir="./downloads",
11     stop_lists=stop_lists
12 )

Advanced Usage

Processing CC-NEWS Data

For Common Crawl News data, use the news crawl type with month-based snapshots:

1 cc_stage = CommonCrawlDownloadExtractStage(
2     start_snapshot="2020-08",  # YYYY-MM format for CC-NEWS
3     end_snapshot="2020-10",
4     download_dir="./news_downloads",
5     crawl_type="news"  # Use CC-NEWS instead of CC-MAIN
6 )

See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.

Large-Scale Processing

For production workloads, consider these optimizations:

1 cc_stage = CommonCrawlDownloadExtractStage(
2     start_snapshot="2020-50",
3     end_snapshot="2020-50", 
4     download_dir="/fast_storage/cc_downloads",
5     use_aws_to_download=True,  # Faster S3 downloads
6     verbose=False,  # Reduce logging overhead
7     # Remove limits for full processing
8     # url_limit=None,
9     # record_limit=None
10 )