nemo_curator.stages.text.download.common_crawl.download
nemo_curator.stages.text.download.common_crawl.download
Module Contents
Classes
Data
API
Bases: DocumentDownloader
Downloads WARC files from the Common Crawl to a local directory
Download a file to a temporary file.
Parameters:
URL to download
Local path to save file
Returns: bool
Tuple of (success, error_message). If success is True, error_message is None.
Generate output filename from URL.
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Reads WARC records directly from Common Crawl using HTTPS range requests.
This stage fetches raw HTML content from Common Crawl’s public servers using byte-range requests. No AWS credentials or s5cmd required.
Get or create a requests session for connection pooling.
Fetch a single WARC record using HTTPS range request.
This method:
- Fetches gzip-compressed WARC record bytes via HTTP range request
- Decompresses the gzip content
- Parses the WARC record format using warcio
- Extracts and returns the HTTP response body (the actual content)
Fetch multiple records in parallel using ThreadPoolExecutor.