utils.download_utils#

Module Contents#

Functions#

get_arxiv_urls

get_common_crawl_snapshot_index

get_common_crawl_urls

Retrieves the URLs for all the compressed WARC files between given Common Crawl snapshots

get_main_warc_paths

get_news_warc_paths

get_wikipedia_urls

Retrieves all urls pointing to the latest Wikipedia dumps

API#

utils.download_utils.get_arxiv_urls() list[str]#
utils.download_utils.get_common_crawl_snapshot_index(index_prefix: str) list[dict]#
utils.download_utils.get_common_crawl_urls(
starting_snapshot: str,
ending_snapshot: str,
data_domain_prefix: str = 'https://data.commoncrawl.org',
index_prefix: str = 'https://index.commoncrawl.org',
news: bool = False,
) list[str]#

Retrieves the URLs for all the compressed WARC files between given Common Crawl snapshots

Args: starting_snapshot: The first common crawl snapshot to include. Snapshots must be specified by YYYY-WeekNumber (e.g., ‘2020-50’ or ‘2021-04’). For the CC-NEWS dataset, (specified with news=True flag) this changes to Year-Month (YYYY-MM). ending_snapshot: The last common crawl snapshot to include. Must be chronologically after the starting snapshot. data_domain_prefix: The prefix that will be prepended to each WARC file to create the URL. index_prefix: The prefix of the URL to the Common Crawl index. news: If True, gets WARC URLs for the CC-NEWS dataset instead of the CC-MAIN datasets. Also assumes that the format for the start and end snapshots is ‘YYYY-MM’ (Year-Month).

utils.download_utils.get_main_warc_paths(
snapshot_index: list[dict],
start_snapshot: str,
end_snapshot: str,
prefix: str = 'https://data.commoncrawl.org',
) list[str]#
utils.download_utils.get_news_warc_paths(
start_date: str,
end_date: str,
prefix: str = 'https://data.commoncrawl.org',
) list[str]#
utils.download_utils.get_wikipedia_urls(
language: str = 'en',
wikidumps_index_prefix: str = 'https://dumps.wikimedia.org',
dump_date: str | None = None,
) list[str]#

Retrieves all urls pointing to the latest Wikipedia dumps

Args: language: Desired language of the Wikipedia dump. wikidumps_index_prefix: The base url for all wikipedia dumps dump_date: A string formatted as “YYYYMMDD” for the wikipedia dump to use. If None, latest dump is used.