stages.text.download.common_crawl.download
#
Module Contents#
Classes#
Downloads WARC files from the Common Crawl to a local directory |
API#
- class stages.text.download.common_crawl.download.CommonCrawlWARCDownloader(
- download_dir: str,
- use_aws_to_download: bool = False,
- verbose: bool = False,
Bases:
nemo_curator.stages.text.download.DocumentDownloader
Downloads WARC files from the Common Crawl to a local directory
Initialization
Creates a downloader
Args: download_dir: Path to store raw compressed WARC files use_aws_to_download: If True, uses the s5cmd command to download from the Common Crawl’s S3 bucket. If False, uses wget. verbose: If True, logs stdout and stderr of the download command (s5cmd/wget)