nemo_curator.stages.text.download.common_crawl.download

View as Markdown

Module Contents

Classes

NameDescription
CommonCrawlWARCDownloaderDownloads WARC files from the Common Crawl to a local directory
CommonCrawlWARCReaderReads WARC records directly from Common Crawl using HTTPS range requests.

Data

CC_BASE_URL

HTTP_OK

HTTP_PARTIAL_CONTENT

API

class nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCDownloader(
download_dir: str,
use_aws_to_download: bool = False,
verbose: bool = False
)

Bases: DocumentDownloader

Downloads WARC files from the Common Crawl to a local directory

nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCDownloader._download_to_path(
url: str,
path: str
) -> tuple[bool, str | None]

Download a file to a temporary file.

Parameters:

url
str

URL to download

path
str

Local path to save file

Returns: bool

Tuple of (success, error_message). If success is True, error_message is None.

nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCDownloader._get_output_filename(
url: str
) -> str

Generate output filename from URL.

class nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader(
warc_filename_col: str = 'warc_filename',
warc_record_offset_col: str = 'warc_record_offset',
warc_record_length_col: str = 'warc_record_length',
binary_content_col: str = 'binary_content',
drop_failed: bool = True,
max_workers: int = 16,
timeout: int = 30,
max_retries: int = 3
)

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Reads WARC records directly from Common Crawl using HTTPS range requests.

This stage fetches raw HTML content from Common Crawl’s public servers using byte-range requests. No AWS credentials or s5cmd required.

name
= 'CommonCrawlWARCReader'
nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader._get_session() -> requests.Session

Get or create a requests session for connection pooling.

nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader._read_warc_record(
row: pandas.Series
) -> bytes | None

Fetch a single WARC record using HTTPS range request.

This method:

  1. Fetches gzip-compressed WARC record bytes via HTTP range request
  2. Decompresses the gzip content
  3. Parses the WARC record format using warcio
  4. Extracts and returns the HTTP response body (the actual content)
nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader._read_warc_records_batch(
df_partition: pandas.DataFrame
) -> list[bytes | None]

Fetch multiple records in parallel using ThreadPoolExecutor.

nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.download.common_crawl.download.CommonCrawlWARCReader.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
nemo_curator.stages.text.download.common_crawl.download.CC_BASE_URL = 'https://data.commoncrawl.org/'
nemo_curator.stages.text.download.common_crawl.download.HTTP_OK = 200
nemo_curator.stages.text.download.common_crawl.download.HTTP_PARTIAL_CONTENT = 206