utils.download_utils
#
Module Contents#
Functions#
Retrieves the URLs for all the compressed WARC files between given Common Crawl snapshots |
|
Retrieves all urls pointing to the latest Wikipedia dumps |
API#
- utils.download_utils.get_arxiv_urls() list[str] #
- utils.download_utils.get_common_crawl_snapshot_index(index_prefix: str) list[dict] #
- utils.download_utils.get_common_crawl_urls(
- starting_snapshot: str,
- ending_snapshot: str,
- data_domain_prefix: str = 'https://data.commoncrawl.org',
- index_prefix: str = 'https://index.commoncrawl.org',
- news: bool = False,
Retrieves the URLs for all the compressed WARC files between given Common Crawl snapshots
Args: starting_snapshot: The first common crawl snapshot to include. Snapshots must be specified by YYYY-WeekNumber (e.g., ‘2020-50’ or ‘2021-04’). For the CC-NEWS dataset, (specified with news=True flag) this changes to Year-Month (YYYY-MM). ending_snapshot: The last common crawl snapshot to include. Must be chronologically after the starting snapshot. data_domain_prefix: The prefix that will be prepended to each WARC file to create the URL. index_prefix: The prefix of the URL to the Common Crawl index. news: If True, gets WARC URLs for the CC-NEWS dataset instead of the CC-MAIN datasets. Also assumes that the format for the start and end snapshots is ‘YYYY-MM’ (Year-Month).
- utils.download_utils.get_main_warc_paths(
- snapshot_index: list[dict],
- start_snapshot: str,
- end_snapshot: str,
- prefix: str = 'https://data.commoncrawl.org',
- utils.download_utils.get_news_warc_paths(
- start_date: str,
- end_date: str,
- prefix: str = 'https://data.commoncrawl.org',
- utils.download_utils.get_wikipedia_urls(
- language: str = 'en',
- wikidumps_index_prefix: str = 'https://dumps.wikimedia.org',
- dump_date: str | None = None,
Retrieves all urls pointing to the latest Wikipedia dumps
Args: language: Desired language of the Wikipedia dump. wikidumps_index_prefix: The base url for all wikipedia dumps dump_date: A string formatted as “YYYYMMDD” for the wikipedia dump to use. If None, latest dump is used.