stages.text.download.common_crawl.url_generation#
Module Contents#
Classes#
Get URLs for Common Crawl data
Each concrete implementation must implement |
|
Get URLs for Common Crawl data
Each concrete implementation must implement |
|
Get URLs for Common Crawl data
Each concrete implementation must implement |
API#
- class stages.text.download.common_crawl.url_generation.BaseCommonCrawlUrlGenerator#
Bases:
nemo_curator.stages.text.download.URLGenerator,abc.ABCGet URLs for Common Crawl data Each concrete implementation must implement
_parse_datetime_from_snapshot_stringandgenerate_path_urls- data_prefix: str#
‘https://data.commoncrawl.org’
- end_snapshot_str: str#
None
- generate_data_urls(
- path_urls: str | list[str] | None = None,
Fetches all relevant warc.paths.gz files, decompresses them, and returns a list of all individual WARC file URLs.
- abstractmethod generate_path_urls() list[str]#
Generates the list of URLs pointing to warc.paths.gz files.
- generate_urls() list[str]#
Process the task and return a list of WARC URLs
- limit: int | None#
None
- start_snapshot_str: str#
None
- class stages.text.download.common_crawl.url_generation.MainCommonCrawlUrlGenerator#
Bases:
stages.text.download.common_crawl.url_generation.BaseCommonCrawlUrlGeneratorGet URLs for Common Crawl data Each concrete implementation must implement
_parse_datetime_from_snapshot_stringandgenerate_path_urls- generate_path_urls() list[str]#
Generates the list of URLs pointing to warc.paths.gz files.
- index_prefix: str#
‘https://index.commoncrawl.org’
- class stages.text.download.common_crawl.url_generation.NewsCommonCrawlUrlGenerator#
Bases:
stages.text.download.common_crawl.url_generation.BaseCommonCrawlUrlGeneratorGet URLs for Common Crawl data Each concrete implementation must implement
_parse_datetime_from_snapshot_stringandgenerate_path_urls- generate_path_urls() list[str]#
Generates the list of URLs pointing to warc.paths.gz files.