stages.text.download.common_crawl.download#

Module Contents#

Classes#

CommonCrawlWARCDownloader

Downloads WARC files from the Common Crawl to a local directory

API#

class stages.text.download.common_crawl.download.CommonCrawlWARCDownloader(
download_dir: str,
use_aws_to_download: bool = False,
verbose: bool = False,
)#

Bases: nemo_curator.stages.text.download.DocumentDownloader

Downloads WARC files from the Common Crawl to a local directory

Initialization

Creates a downloader

Args: download_dir: Path to store raw compressed WARC files use_aws_to_download: If True, uses the s5cmd command to download from the Common Crawl’s S3 bucket. If False, uses wget. verbose: If True, logs stdout and stderr of the download command (s5cmd/wget)