nemo_curator.stages.text.download.common_crawl.url_generation
nemo_curator.stages.text.download.common_crawl.url_generation
Module Contents
Classes
API
DataclassAbstract
Bases: URLGenerator
Get URLs for Common Crawl data
Each concrete implementation must implement _parse_datetime_from_snapshot_string and generate_path_urls
data_prefix
end_snapshot_str
limit
start_snapshot_str
abstract
Parses a snapshot string (YYYY-WW or YYYY-MM) into a datetime object.
Parses the start and end snapshot strings into date objects. For ‘news’ (YYYY-MM), the day is set to 1 for start_date, and the last day of the month for end_date to ensure the full month is covered.
Fetches all relevant warc.paths.gz files, decompresses them, and returns a list of all individual WARC file URLs.
abstract
Generates the list of URLs pointing to warc.paths.gz files.
Process the task and return a list of WARC URLs