stages.text.download.wikipedia.download
#
Module Contents#
Classes#
Downloads Wikipedia dump files (.bz2) from wikimedia.org. |
API#
- class stages.text.download.wikipedia.download.WikipediaDownloader(download_dir: str, verbose: bool = False)#
Bases:
nemo_curator.stages.text.download.DocumentDownloader
Downloads Wikipedia dump files (.bz2) from wikimedia.org.
Initialization
Creates a Wikipedia downloader.
Args: download_dir: Path to store raw compressed .bz2 files verbose: If True, logs stdout and stderr of the download command
- num_workers_per_node() int | None #
Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.
Returns: Number of workers per node, or None if there is no limit and we can download as fast as possible