stages.text.download.common_crawl.url_generation#

Module Contents#

Classes#

BaseCommonCrawlUrlGenerator

Get URLs for Common Crawl data Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls

MainCommonCrawlUrlGenerator

Get URLs for Common Crawl data Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls

NewsCommonCrawlUrlGenerator

Get URLs for Common Crawl data Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls

API#

class stages.text.download.common_crawl.url_generation.BaseCommonCrawlUrlGenerator#

Bases: nemo_curator.stages.text.download.URLGenerator, abc.ABC

Get URLs for Common Crawl data Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls

data_prefix: str#

‘https://data.commoncrawl.org’

end_snapshot_str: str#

None

generate_data_urls(
path_urls: str | list[str] | None = None,
) list[str]#

Fetches all relevant warc.paths.gz files, decompresses them, and returns a list of all individual WARC file URLs.

abstractmethod generate_path_urls() list[str]#

Generates the list of URLs pointing to warc.paths.gz files.

generate_urls() list[str]#

Process the task and return a list of WARC URLs

limit: int | None#

None

start_snapshot_str: str#

None

class stages.text.download.common_crawl.url_generation.MainCommonCrawlUrlGenerator#

Bases: stages.text.download.common_crawl.url_generation.BaseCommonCrawlUrlGenerator

Get URLs for Common Crawl data Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls

generate_path_urls() list[str]#

Generates the list of URLs pointing to warc.paths.gz files.

index_prefix: str#

‘https://index.commoncrawl.org’

class stages.text.download.common_crawl.url_generation.NewsCommonCrawlUrlGenerator#

Bases: stages.text.download.common_crawl.url_generation.BaseCommonCrawlUrlGenerator

Get URLs for Common Crawl data Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls

generate_path_urls() list[str]#

Generates the list of URLs pointing to warc.paths.gz files.