`stages.text.download.common_crawl.stage`#

Module Contents#

Classes#

CommonCrawlDownloadExtractStage

Composite stage for downloading and processing Common Crawl data.

API#

class stages.text.download.common_crawl.stage.CommonCrawlDownloadExtractStage( start_snapshot: str, end_snapshot: str, download_dir: str, crawl_type: Literal[main, news] = 'main', html_extraction: nemo_curator.stages.text.download.html_extractors.HTMLExtractorAlgorithm | str | None = None, html_extraction_kwargs: dict | None = None, stop_lists: dict[str, frozenset[str]] | None = None, use_aws_to_download: bool = False, verbose: bool = False, url_limit: int | None = None, record_limit: int | None = None, add_filename_column: bool | str = True, )#

Bases: nemo_curator.stages.text.download.DocumentDownloadExtractStage

Composite stage for downloading and processing Common Crawl data.

This pipeline:

Generates WARC URLs (either from main or news crawls)
Downloads WARC files
Extracts content from WARC files
Extracts text from HTML content

Initialization

decompose() → list[nemo_curator.stages.base.ProcessingStage]#: Decompose this composite stage into its constituent stages.

get_description() → str#: Get a description of this composite stage.

stages.text.download.common_crawl.stage#

Module Contents#

Classes#

API#

`stages.text.download.common_crawl.stage`#