download.wikipedia#

Module Contents#

Classes#

WikipediaDownloader

Abstract class for downloading remote data to disk

WikipediaExtractor

Abstract class for extracting text from records read from disk

WikipediaIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Functions#

download_wikipedia

Downloads and extracts articles from a Wikipedia dump.

Data#

API#

download.wikipedia.CAT_ALIASES#

None

download.wikipedia.MEDIA_ALIASES#

None

class download.wikipedia.WikipediaDownloader(download_dir: str, verbose: bool = False)#

Bases: nemo_curator.download.doc_builder.DocumentDownloader

Abstract class for downloading remote data to disk

Initialization

download(url: str) str#
class download.wikipedia.WikipediaExtractor(language: str = 'en', parser=mwparserfromhell)#

Bases: nemo_curator.download.doc_builder.DocumentExtractor

Abstract class for extracting text from records read from disk

Initialization

extract(content) dict[str, str]#
class download.wikipedia.WikipediaIterator(language: str = 'en', log_frequency: int = 1000)#

Bases: nemo_curator.download.doc_builder.DocumentIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Initialization

iterate(
file_path: str,
) collections.abc.Iterator[tuple[dict[str, str], str]]#
download.wikipedia.download_wikipedia(
output_path: str,
language: str = 'en',
dump_date: str | None = None,
output_type: Literal[jsonl, parquet] = 'jsonl',
raw_download_dir: str | None = None,
keep_raw_download: bool = False,
force_download: bool = False,
url_limit: int | None = None,
record_limit: int | None = None,
) nemo_curator.datasets.DocumentDataset#

Downloads and extracts articles from a Wikipedia dump.

This function retrieves a list of Wikipedia dump URLs for the specified language and dump date, downloads the compressed bz2 dump file (if it is not already present), and extracts its articles using mwparserfromhell. The resulting articles are saved in the specified output format (e.g., “jsonl”) along with relevant metadata.

Args: output_path (str): The root directory where the final extracted files and intermediate outputs (if any) are stored. language (str, optional): The language code for the Wikipedia dump to download. Default is “en”. dump_date (Optional[str], optional): The dump date in “YYYYMMDD” format. If None, the latest available dump is downloaded. output_type (Literal[“jsonl”, “parquet”], optional): The file format/extension for saving the extracted documents (e.g., “jsonl”). Defaults to “jsonl”. This is not used for the output file, but is used to check if an extracted output already exists and read it if so. raw_download_dir (Optional[str], optional): Directory used for temporary storage of raw bz2 dump files. If None, a subdirectory named “downloads” under output_path is used. keep_raw_download (bool, optional): If True, retains the raw bz2 files after extraction. Default is False. force_download (bool, optional): If False, skips re-downloading or re-extracting files that already exist. url_limit (Optional[int], optional): The maximum number of dump file URLs to process. If None, all available URLs are processed. record_limit (Optional[int], optional): Limit the number of records to extract from each file. If None, all available records are extracted.

Returns: DocumentDataset: A dataset object containing the extracted Wikipedia articles along with associated metadata.