download.wikipedia
#
Module Contents#
Classes#
Abstract class for downloading remote data to disk |
|
Abstract class for extracting text from records read from disk |
|
Abstract iterator class for reading in raw records that have been downloaded to disk |
Functions#
Downloads and extracts articles from a Wikipedia dump. |
Data#
API#
- download.wikipedia.CAT_ALIASES#
None
- download.wikipedia.MEDIA_ALIASES#
None
- class download.wikipedia.WikipediaDownloader(download_dir: str, verbose: bool = False)#
Bases:
nemo_curator.download.doc_builder.DocumentDownloader
Abstract class for downloading remote data to disk
Initialization
- download(url: str) str #
- class download.wikipedia.WikipediaExtractor(language: str = 'en', parser=mwparserfromhell)#
Bases:
nemo_curator.download.doc_builder.DocumentExtractor
Abstract class for extracting text from records read from disk
Initialization
- extract(content) dict[str, str] #
- class download.wikipedia.WikipediaIterator(language: str = 'en', log_frequency: int = 1000)#
Bases:
nemo_curator.download.doc_builder.DocumentIterator
Abstract iterator class for reading in raw records that have been downloaded to disk
Initialization
- iterate(
- file_path: str,
- download.wikipedia.download_wikipedia(
- output_path: str,
- language: str = 'en',
- dump_date: str | None = None,
- output_type: Literal[jsonl, parquet] = 'jsonl',
- raw_download_dir: str | None = None,
- keep_raw_download: bool = False,
- force_download: bool = False,
- url_limit: int | None = None,
- record_limit: int | None = None,
Downloads and extracts articles from a Wikipedia dump.
This function retrieves a list of Wikipedia dump URLs for the specified language and dump date, downloads the compressed bz2 dump file (if it is not already present), and extracts its articles using mwparserfromhell. The resulting articles are saved in the specified output format (e.g., “jsonl”) along with relevant metadata.
Args: output_path (str): The root directory where the final extracted files and intermediate outputs (if any) are stored. language (str, optional): The language code for the Wikipedia dump to download. Default is “en”. dump_date (Optional[str], optional): The dump date in “YYYYMMDD” format. If None, the latest available dump is downloaded. output_type (Literal[“jsonl”, “parquet”], optional): The file format/extension for saving the extracted documents (e.g., “jsonl”). Defaults to “jsonl”. This is not used for the output file, but is used to check if an extracted output already exists and read it if so. raw_download_dir (Optional[str], optional): Directory used for temporary storage of raw bz2 dump files. If None, a subdirectory named “downloads” under output_path is used. keep_raw_download (bool, optional): If True, retains the raw bz2 files after extraction. Default is False. force_download (bool, optional): If False, skips re-downloading or re-extracting files that already exist. url_limit (Optional[int], optional): The maximum number of dump file URLs to process. If None, all available URLs are processed. record_limit (Optional[int], optional): Limit the number of records to extract from each file. If None, all available records are extracted.
Returns: DocumentDataset: A dataset object containing the extracted Wikipedia articles along with associated metadata.