stages.text.download.wikipedia.url_generation#

Module Contents#

Classes#

WikipediaUrlGenerator

Generates URLs for Wikipedia dump files.

Data#

API#

stages.text.download.wikipedia.url_generation.REQUEST_TIMEOUT#

30

class stages.text.download.wikipedia.url_generation.WikipediaUrlGenerator#

Bases: nemo_curator.stages.text.download.URLGenerator

Generates URLs for Wikipedia dump files.

dump_date: str | None#

None

generate_urls() list[str]#

Generate Wikipedia dump URLs.

Returns: List of URLs pointing to Wikipedia dump files

language: str#

‘en’

wikidumps_index_prefix: str#

‘https://dumps.wikimedia.org’