stages.text.download.wikipedia.download#

Module Contents#

Classes#

WikipediaDownloader

Downloads Wikipedia dump files (.bz2) from wikimedia.org.

API#

class stages.text.download.wikipedia.download.WikipediaDownloader(download_dir: str, verbose: bool = False)#

Bases: nemo_curator.stages.text.download.DocumentDownloader

Downloads Wikipedia dump files (.bz2) from wikimedia.org.

Initialization

Creates a Wikipedia downloader.

Args: download_dir: Path to store raw compressed .bz2 files verbose: If True, logs stdout and stderr of the download command

num_workers_per_node() int | None#

Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.

Returns: Number of workers per node, or None if there is no limit and we can download as fast as possible