Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Download and Extract

Base Classes

class nemo_curator.download.DocumentDownloader

Abstract class for downloading remote data to disk

class nemo_curator.download.DocumentIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

class nemo_curator.download.DocumentExtractor

Abstract class for extracting text from records read from disk

class nemo_curator.download.batch_download(urls: List[str], downloader: nemo_curator.download.doc_builder.DocumentDownloader)

Downloads all the urls using the downloader in parallel

class nemo_curator.download.download_and_extract(urls: List[str], output_paths: List[str], downloader: nemo_curator.download.doc_builder.DocumentDownloader, iterator: nemo_curator.download.doc_builder.DocumentIterator, extractor: nemo_curator.download.doc_builder.DocumentExtractor, output_format: dict, output_type: str = 'jsonl', keep_raw_download=False, force_download=False, input_meta: Optional[Union[str, dict]] = None)

Downloads and extracts a dataset into a format accepted by the NeMo Curator

Parameters
  • urls – A list of urls to download the dataset from

  • output_paths – A list of paths to save the final extracted output to. The raw output of the downloader will be saved using the path given by downloader.download(url).

  • downloader – A DocumentDownloader that handles retrieving each file from its url and saving it to storage

  • iterator – A DocumentIterator that handles iterating through the downloaded file’s format

  • extractor – A DocumentExtractor that handles extracting the data from its raw format into text

  • output_format – A dictionary mappings columns to datatypes for the fields of each datapoint after extraction.

  • output_type – The file type to save the dataset as.

  • keep_raw_download – Whether to keep the pre-extracted download file.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • input_meta – A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.

Returns

A DocumentDataset of the downloaded data

class nemo_curator.download.import_downloader(downloader_path)
class nemo_curator.download.import_extractor(extractor_path)
class nemo_curator.download.import_iterator(iterator_path)

Common Crawl

class nemo_curator.download.download_common_crawl(output_path: str, start_snapshot: str, end_snapshot: str, output_type: str = 'jsonl', algorithm=<nemo_curator.download.commoncrawl.JusTextExtractor object>, news=False, aws=False, raw_download_dir=None, keep_raw_download=False, force_download=False, url_limit=None)

Downloads Common Crawl WARC snapshots and extracts them using jusText or Resiliparse

Parameters
  • output_path – The path to the root directory of the files

  • start_snapshot – The first common crawl snapshot to include. Snapshots must be specified by YYYY-WeekNumber (e.g., ‘2020-50’ or ‘2021-04’). For the CC-NEWS dataset, (specified with news=True flag) this changes to Year-Month (YYYY-MM).

  • end_snapshot – The last common crawl snapshot to include. Must be chronologically after the starting snapshot.

  • output_type – The file type to save the data as.

  • algorithm – A JusTextExtractor or ResiliparseExtractor object.

  • news – If True, gets WARC URLs for the CC-NEWS dataset instead of the CC-MAIN datasets. Also assumes that the format for the start and end snapshots is ‘YYYY-MM’ (Year-Month).

  • aws – Whether to download from Common Crawl’s S3 bucket. If True, uses s5cmd to download. If False, uses wget.

  • raw_download_dir – Path to store the raw download files for intermediate processing. If None, they are stored in a folder named “downloads” under output_path.

  • keep_raw_download – If True, keeps the compressed WARC files that have not been extracted.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • url_limit – The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.

class nemo_curator.download.CommonCrawlWARCDownloader(download_dir, aws=False, verbose=False)

Downloads WARC files from the Common Crawl

class nemo_curator.download.CommonCrawlWARCExtractor(algorithm=<nemo_curator.download.commoncrawl.JusTextExtractor object>)
class nemo_curator.download.CommonCrawlWARCIterator(log_frequency=1000)
class nemo_curator.download.CommonCrawlWARCDownloaderExtractOnly(aws=False, verbose=False)

A ‘dummy’ downloader that simply puts pre-downloaded files on the queue

class nemo_curator.download.JusTextExtractor(length_low=70, length_high=200, stopwords_low=0.3, stopwords_high=0.32, max_link_density=0.2, max_heading_distance=200, no_headings=False, logger=None)
class nemo_curator.download.ResiliparseExtractor(required_stopword_density=0.32, main_content=True, alt_texts=False)

Wikipedia

class nemo_curator.download.download_wikipedia(output_path: str, language: str = 'en', dump_date=None, output_type: str = 'jsonl', raw_download_dir=None, keep_raw_download=False, force_download=False, url_limit=None)

Downloads the latest Wikipedia dumps and extracts them using mwparserfromhell

Parameters
  • output_path – The path to the root directory of the files

  • language – The language of the Wikipedia articles to download

  • dump_date – A string formatted as “YYYYMMDD” for the wikipedia dump to use. If None, latest dump is used.

  • output_type – The file type to save the data as.

  • raw_download_dir – Path to store the raw download files for intermediate processing. If None, they are stored in a folder named “downloads” under output_path.

  • keep_raw_download – If True, keeps the bz2 files that have not been extracted.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • url_limit – The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.

class nemo_curator.download.WikipediaDownloader(download_dir, verbose=False)
class nemo_curator.download.WikipediaIterator(language='en', log_frequency=1000)
class nemo_curator.download.WikipediaExtractor(language='en', parser=mwparserfromhell)

Arxiv

class nemo_curator.download.download_arxiv(output_path: str, output_type: str = 'jsonl', raw_download_dir=None, keep_raw_download=False, force_download=False, url_limit=None)

Downloads Arxiv tar files and extracts them

Parameters
  • output_path – The path to the root directory of the files

  • output_type – The file type to save the data as.

  • raw_download_dir – Path to store the raw download files for intermediate processing. If None, they are stored in a folder named “downloads” under output_path.

  • keep_raw_download – If True, keeps the compressed WARC files that have not been extracted.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • url_limit – The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.

class nemo_curator.download.ArxivDownloader(download_dir, verbose=False)
class nemo_curator.download.ArxivIterator(log_frequency=1000)
class nemo_curator.download.ArxivExtractor