*** description: Download and extract text from Common Crawl web archives using Curator. categories: * how-to-guides tags: * common-crawl * web-data * warc * language-detection * distributed * html-extraction * pipeline personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: how-to modality: text-only *** # Common Crawl Download and extract text from Common Crawl snapshots using Curator. Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (`.warc.gz`), which requires processing to extract useful text for language model training. ## How it Works Curator's Common Crawl processing pipeline consists of four sequential stages: 1. **URL Generation**: Generates WARC file URLs from Common Crawl's index for the specified snapshot range 2. **Download**: Downloads the compressed WARC files from Common Crawl's servers (optionally using S3 for faster downloads) 3. **Iteration**: Extracts individual records from WARC files and decodes HTML content 4. **Extraction**: Performs language detection and extracts clean text using configurable HTML extraction algorithms The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing. ## Before You Start Choose your download method and ensure you have the prerequisites: * HTTPS downloads (default): No AWS account required. * S3 downloads (set `use_aws_to_download=True`): * An AWS account with credentials configured (profile, environment, or instance role). * Common Crawl's S3 access uses Requester Pays; you incur charges for requests and data transfer. * `s5cmd` installed for fast S3 listing and copy operations: ```bash # Install s5cmd for faster S3 downloads pip install s5cmd ``` *** ## Usage Here's how to create and run a Common Crawl processing pipeline: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage from nemo_curator.stages.text.io.writer import JsonlWriter def main(): # Initialize Ray client ray_client = RayClient() ray_client.start() # Create pipeline pipeline = Pipeline( name="common_crawl_pipeline", description="Download and process Common Crawl data" ) # Add Common Crawl processing stage cc_stage = CommonCrawlDownloadExtractStage( start_snapshot="2020-50", # YYYY-WW format for CC-MAIN end_snapshot="2020-50", download_dir="./cc_downloads", crawl_type="main", # or "news" use_aws_to_download=True, # Faster S3 downloads (requires s5cmd) url_limit=10, # Limit number of WARC files for testing record_limit=1000, # Limit records per WARC file ) pipeline.add_stage(cc_stage) # Add output writer stage writer = JsonlWriter("./cc_output") pipeline.add_stage(writer) # Run pipeline results = pipeline.run() # Stop Ray client ray_client.stop() if __name__ == "__main__": main() ``` For executor options and configuration, refer to [Execution Backends](/reference/infra/execution-backends). ### Writing to Parquet To write to Parquet files instead of JSONL, use `ParquetWriter`: ```python from nemo_curator.stages.text.io.writer import ParquetWriter # Replace the JSONL writer with ParquetWriter writer = ParquetWriter("./cc_output_parquet") pipeline.add_stage(writer) ``` ### Parameters | Parameter | Type | Description | Default | | ------------------------ | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- | | `start_snapshot` | str | First snapshot to include (format: "YYYY-WW" for main, "YYYY-MM" for news). Not every year and week has a snapshot; refer to the official list at [https://data.commoncrawl.org/](https://data.commoncrawl.org/). | Required | | `end_snapshot` | str | Last snapshot to include (same format as `start_snapshot`). Ensure your range includes at least one valid snapshot. | Required | | `download_dir` | str | Directory to store downloaded WARC files | Required | | `crawl_type` | Literal\["main", "news"] | Whether to use CC-MAIN or CC-NEWS dataset | "main" | | `html_extraction` | HTMLExtractorAlgorithm \| str \| None | Text extraction algorithm to use. Defaults to `JusTextExtractor()` if not specified. | JusTextExtractor() if not specified | | `html_extraction_kwargs` | dict \| None | Additional arguments for the HTML extractor. Ignored when `html_extraction` is a concrete extractor object (for example, `JusTextExtractor()`); pass kwargs to the extractor constructor instead. When `html_extraction` is a string ("justext", "resiliparse", or "trafilatura"), kwargs are forwarded. | None | | `stop_lists` | dict\[str, frozenset\[str]] \| None | Language-specific stop words for text quality assessment. If not provided, Curator uses jusText defaults with additional support for Thai, Chinese, and Japanese languages. | None | | `use_aws_to_download` | bool | Use S3 downloads via s5cmd instead of HTTPS (requires s5cmd installation) | False | | `verbose` | bool | Enable verbose logging for download operations | False | | `url_limit` | int \| None | Maximum number of WARC files to download (useful for testing) | None | | `record_limit` | int \| None | Maximum number of records to extract per WARC file | None | | `add_filename_column` | bool \| str | Whether to add source filename column to output; if str, uses it as the column name (default name: "file\_name") | True | ## Output Format The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields: ```json { "url": "http://example.com/page.html", "warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac", "source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz", "language": "ENGLISH", "text": "Extracted web page content..." } ``` | Field | Description | | ----------- | ------------------------------------------------------------- | | `url` | Original URL of the web page | | `warc_id` | Unique identifier for the WARC record | | `source_id` | Name of the source WARC file | | `language` | Detected language of the content (e.g., "ENGLISH", "SPANISH") | | `text` | Extracted and cleaned text content | If you enable `add_filename_column`, the output includes an extra field `file_name` (or your custom column name). ## Customization Options ### HTML Text Extraction Algorithms Curator supports several HTML text extraction algorithms: | Extractor | Library | | ---------------------- | ------------------------------------------------------------------ | | `JusTextExtractor` | [jusText](https://github.com/miso-belica/jusText) | | `ResiliparseExtractor` | [Resiliparse](https://github.com/chatnoir-eu/chatnoir-resiliparse) | | `TrafilaturaExtractor` | [Trafilatura](https://trafilatura.readthedocs.io/) | #### Configuring HTML Extractors ```python from nemo_curator.stages.text.download.html_extractors import ResiliparseExtractor from nemo_curator.stages.text.download.html_extractors import TrafilaturaExtractor # Use Resiliparse for extraction cc_stage = CommonCrawlDownloadExtractStage( start_snapshot="2020-50", end_snapshot="2020-50", download_dir="./downloads", html_extraction=ResiliparseExtractor( required_stopword_density=0.25, main_content=True ) ) # Or use Trafilatura with custom parameters cc_stage = CommonCrawlDownloadExtractStage( start_snapshot="2020-50", end_snapshot="2020-50", download_dir="./downloads", html_extraction=TrafilaturaExtractor( min_extracted_size=200, max_repetitions=3 ) ) ``` ### Language Processing You can customize language detection and extraction by providing stop words for different languages: ```python # Define custom stop words for specific languages stop_lists = { "ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"]), "SPANISH": frozenset(["el", "la", "de", "que", "y", "en", "un", "es", "se", "no"]) } cc_stage = CommonCrawlDownloadExtractStage( start_snapshot="2020-50", end_snapshot="2020-50", download_dir="./downloads", stop_lists=stop_lists ) ``` ## Advanced Usage ### Processing CC-NEWS Data For Common Crawl News data, use the `news` crawl type with month-based snapshots: ```python cc_stage = CommonCrawlDownloadExtractStage( start_snapshot="2020-08", # YYYY-MM format for CC-NEWS end_snapshot="2020-10", download_dir="./news_downloads", crawl_type="news" # Use CC-NEWS instead of CC-MAIN ) ``` See [https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html](https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html) for more information. ### Large-Scale Processing For production workloads, consider these optimizations: ```python cc_stage = CommonCrawlDownloadExtractStage( start_snapshot="2020-50", end_snapshot="2020-50", download_dir="/fast_storage/cc_downloads", use_aws_to_download=True, # Faster S3 downloads verbose=False, # Reduce logging overhead # Remove limits for full processing # url_limit=None, # record_limit=None ) ```