Common Crawl#
Download and extract text from Common Crawl snapshots using NeMo Curator utilities.
Common Crawl provides petabytes of web data collected over years of web crawling. The data is stored in a compressed web archive format (.warc.gz
), which needs to be processed to extract useful text for language model training.
How it Works#
NeMo Curator’s Common Crawl extraction process:
Downloads the compressed WARC files from Common Crawl’s servers (optionally using S3 for faster downloads)
Decodes the HTML within each record from binary to text
Performs language detection using pyCLD2
Extracts the relevant text using one of several text extraction algorithms
Outputs the extracted text as
.jsonl
files for further processing
Usage#
Here’s how to download and extract Common Crawl data:
import os
from nemo_curator import get_client
from nemo_curator.download import download_common_crawl
from nemo_curator.datasets import DocumentDataset
def main():
# Initialize a Dask client
client = get_client(cluster_type="cpu")
# Set parameters for downloading
output_path = "/extracted/output/folder"
start_snapshot = "2020-50"
end_snapshot = "2021-04"
output_type = "jsonl"
os.makedirs(output_path, exist_ok=True)
# Download and extract Common Crawl data
common_crawl_dataset = download_common_crawl(
output_path, start_snapshot, end_snapshot, output_type=output_type
)
# Write the dataset to disk
common_crawl_dataset.to_json(output_path=output_path, write_to_filename=True)
print("Extracted dataset saved to:", output_path)
if __name__ == "__main__":
main()
First, generate a list of URLs:
get_common_crawl_urls \
--starting-snapshot="2020-50" \
--ending-snapshot="2020-50" \
--output-warc-url-file=./url_data/warc_urls_cc_2020_50.txt
Then download and extract:
download_and_extract \
--input-url-file=./url_data/warc_urls_cc_2020_50.txt \
--builder-config-file=./config/cc_warc_builder.yaml \
--output-json-dir=/datasets/CC-MAIN-2020-50/json
The config file should look like:
download_module: nemo_curator.download.commoncrawl.CommonCrawlWARCDownloader
download_params:
aws: True # Optional: Set to True to use S3 for faster downloads
iterator_module: nemo_curator.download.commoncrawl.CommonCrawlWARCIterator
iterator_params: {}
extract_module: nemo_curator.download.commoncrawl.CommonCrawlWARCExtractor
extract_params: {}
Note
The download_params
section can include optional parameters like aws: True
for S3 downloads or verbose: True
for detailed logging. If no custom parameters are needed, use download_params: {}
.
Parameters#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
str |
Path where the extracted files will be placed |
Required |
|
str |
First Common Crawl snapshot to include (format: “YYYY-WW” for CC-MAIN, “YYYY-MM” for CC-NEWS) |
Required |
|
str |
Last Common Crawl snapshot to include |
Required |
|
Literal[“jsonl”, “parquet”] |
File format for storing data |
“jsonl” |
|
HTMLExtractorAlgorithm |
Text extraction algorithm to use (JusTextExtractor, ResiliparseExtractor, or TrafilaturaExtractor) |
JusTextExtractor() |
|
Optional[Dict[str, frozenset]] |
Dictionary of language-specific stop words |
None |
|
bool |
Whether to use CC-NEWS dataset instead of CC-MAIN |
False |
|
bool |
Whether to download from S3 using s5cmd instead of HTTPS (requires s5cmd to be installed) |
False |
|
Optional[str] |
Directory to store raw WARC files |
None |
|
bool |
Whether to keep the raw downloaded files |
False |
|
bool |
Whether to force re-download even if files exist |
False |
|
Optional[int] |
Maximum number of WARC files to download |
None |
|
Optional[int] |
Maximum number of records to extract per file |
None |
Snapshot Availability
Not every year and week has a snapshot. Ensure your range includes at least one valid Common Crawl snapshot. See the official site for a list of valid snapshots.
Output Format#
The extracted text is stored in .jsonl
files with the following format:
{
"text": "Extracted web page content...",
"warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac",
"source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz",
"url": "http://example.com/page.html",
"language": "ENGLISH",
"file_name": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz.jsonl"
}
Customization Options#
Text Extraction#
NeMo Curator supports multiple HTML text extraction algorithms:
JusTextExtractor (default): Uses jusText to extract main content
ResiliparseExtractor: Uses Resiliparse for extraction
TrafilaturaExtractor: Uses Trafilatura for extraction
You can select a different extractor as follows:
from nemo_curator.download import (
ResiliparseExtractor,
TrafilaturaExtractor,
download_common_crawl
)
# Use Resiliparse for extraction
extraction_algorithm = ResiliparseExtractor()
common_crawl_dataset = download_common_crawl(
output_path,
start_snapshot,
end_snapshot,
output_type=output_type,
algorithm=extraction_algorithm,
)
Each extractor has unique parameters – check their docstrings for details.
Language Processing#
You can customize language detection and extraction by providing stop words for different languages:
from nemo_curator.download import download_common_crawl
# Define custom stop words for specific languages
stop_lists = {"ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"])}
common_crawl = download_common_crawl(
"/extracted/output/folder",
"2020-50",
"2021-04",
output_type="jsonl",
stop_lists=stop_lists,
)
Note
If no custom stop lists are provided, NeMo Curator uses jusText’s default stop lists with additional support for Thai, Chinese, and Japanese languages.