nemo_curator.stages.text.download.wikipedia.url_generation

View as Markdown

Module Contents

Classes

NameDescription
WikipediaUrlGeneratorGenerates URLs for Wikipedia dump files.

Data

REQUEST_TIMEOUT

API

class nemo_curator.stages.text.download.wikipedia.url_generation.WikipediaUrlGenerator(
language: str = 'en',
dump_date: str | None = None,
wikidumps_index_prefix: str = 'https://dumps.wikimedia.org'
)
Dataclass

Bases: URLGenerator

Generates URLs for Wikipedia dump files.

dump_date
str | None = None
language
str = 'en'
wikidumps_index_prefix
str = 'https://dumps.wikimedia.org'
nemo_curator.stages.text.download.wikipedia.url_generation.WikipediaUrlGenerator._get_data_for_dump(
dump_date: str,
wiki_index_url: str
) -> dict | None

Get the JSON dump data for a given dump date. Returns None if the dump is not found.

nemo_curator.stages.text.download.wikipedia.url_generation.WikipediaUrlGenerator._get_wikipedia_urls() -> list[str]

Retrieves all URLs pointing to Wikipedia dumps for the specified language and date.

Returns: list[str]

List of URLs for Wikipedia dump files

nemo_curator.stages.text.download.wikipedia.url_generation.WikipediaUrlGenerator.generate_urls() -> list[str]

Generate Wikipedia dump URLs.

Returns: list[str]

List of URLs pointing to Wikipedia dump files

nemo_curator.stages.text.download.wikipedia.url_generation.REQUEST_TIMEOUT = 30