nemo_curator.stages.text.download.wikipedia.download

View as Markdown

Module Contents

Classes

NameDescription
WikipediaDownloaderDownloads Wikipedia dump files (.bz2) from wikimedia.org.

API

class nemo_curator.stages.text.download.wikipedia.download.WikipediaDownloader(
download_dir: str,
verbose: bool = False
)

Bases: DocumentDownloader

Downloads Wikipedia dump files (.bz2) from wikimedia.org.

nemo_curator.stages.text.download.wikipedia.download.WikipediaDownloader._download_to_path(
url: str,
path: str
) -> tuple[bool, str | None]

Download a Wikipedia dump file to the specified path.

Parameters:

url
str

URL to download

path
str

Local path to save file

Returns: bool

Tuple of (success, error_message). If success is True, error_message is None.

nemo_curator.stages.text.download.wikipedia.download.WikipediaDownloader._get_output_filename(
url: str
) -> str

Generate output filename from URL.

nemo_curator.stages.text.download.wikipedia.download.WikipediaDownloader.num_workers_per_node() -> int | None