Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Download and Extract#

Base Classes#

class nemo_curator.download.DocumentDownloader#

Abstract class for downloading remote data to disk

class nemo_curator.download.DocumentIterator#

Abstract iterator class for reading in raw records that have been downloaded to disk

class nemo_curator.download.DocumentExtractor#

Abstract class for extracting text from records read from disk

class nemo_curator.download.batch_download(
urls: List[str],
downloader: DocumentDownloader,
)#

Downloads all the urls using the downloader in parallel

class nemo_curator.download.download_and_extract(
urls: List[str],
output_paths: List[str],
downloader: DocumentDownloader,
iterator: DocumentIterator,
extractor: DocumentExtractor,
output_format: dict,
output_type: str = 'jsonl',
keep_raw_download=False,
force_download=False,
input_meta: str | dict | None = None,
)#

Downloads and extracts a dataset into a format accepted by the NeMo Curator

Parameters:
  • urls – A list of urls to download the dataset from

  • output_paths – A list of paths to save the final extracted output to. The raw output of the downloader will be saved using the path given by downloader.download(url).

  • downloader – A DocumentDownloader that handles retrieving each file from its url and saving it to storage

  • iterator – A DocumentIterator that handles iterating through the downloaded file’s format

  • extractor – A DocumentExtractor that handles extracting the data from its raw format into text

  • output_format – A dictionary mappings columns to datatypes for the fields of each datapoint after extraction.

  • output_type – The file type to save the dataset as.

  • keep_raw_download – Whether to keep the pre-extracted download file.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • input_meta – A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.

Returns:

A DocumentDataset of the downloaded data

class nemo_curator.download.import_downloader(downloader_path)#
class nemo_curator.download.import_extractor(extractor_path)#
class nemo_curator.download.import_iterator(iterator_path)#

Common Crawl#

class nemo_curator.download.download_common_crawl(
output_path: str,
start_snapshot: str,
end_snapshot: str,
output_type: str = 'jsonl',
algorithm=<nemo_curator.download.commoncrawl.JusTextExtractor object>,
news=False,
aws=False,
raw_download_dir=None,
keep_raw_download=False,
force_download=False,
url_limit=None,
)#

Downloads Common Crawl WARC snapshots and extracts them using jusText or Resiliparse

Parameters:
  • output_path – The path to the root directory of the files

  • start_snapshot – The first common crawl snapshot to include. Snapshots must be specified by YYYY-WeekNumber (e.g., ‘2020-50’ or ‘2021-04’). For the CC-NEWS dataset, (specified with news=True flag) this changes to Year-Month (YYYY-MM).

  • end_snapshot – The last common crawl snapshot to include. Must be chronologically after the starting snapshot.

  • output_type – The file type to save the data as.

  • algorithm – A JusTextExtractor or ResiliparseExtractor object.

  • news – If True, gets WARC URLs for the CC-NEWS dataset instead of the CC-MAIN datasets. Also assumes that the format for the start and end snapshots is ‘YYYY-MM’ (Year-Month).

  • aws – Whether to download from Common Crawl’s S3 bucket. If True, uses s5cmd to download. If False, uses wget.

  • raw_download_dir – Path to store the raw download files for intermediate processing. If None, they are stored in a folder named “downloads” under output_path.

  • keep_raw_download – If True, keeps the compressed WARC files that have not been extracted.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • url_limit – The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.

class nemo_curator.download.CommonCrawlWARCDownloader(download_dir, aws=False, verbose=False)#

Downloads WARC files from the Common Crawl

class nemo_curator.download.CommonCrawlWARCExtractor(
algorithm=<nemo_curator.download.commoncrawl.JusTextExtractor object>,
)#
class nemo_curator.download.CommonCrawlWARCIterator(log_frequency=1000)#
class nemo_curator.download.CommonCrawlWARCDownloaderExtractOnly(aws=False, verbose=False)#

A ‘dummy’ downloader that simply puts pre-downloaded files on the queue

class nemo_curator.download.JusTextExtractor(
length_low=70,
length_high=200,
stopwords_low=0.3,
stopwords_high=0.32,
max_link_density=0.2,
max_heading_distance=200,
no_headings=False,
logger=None,
)#
class nemo_curator.download.ResiliparseExtractor(
required_stopword_density=0.32,
main_content=True,
alt_texts=False,
)#

Wikipedia#

class nemo_curator.download.download_wikipedia(
output_path: str,
language: str = 'en',
dump_date=None,
output_type: str = 'jsonl',
raw_download_dir=None,
keep_raw_download=False,
force_download=False,
url_limit=None,
)#

Downloads the latest Wikipedia dumps and extracts them using mwparserfromhell

Parameters:
  • output_path – The path to the root directory of the files

  • language – The language of the Wikipedia articles to download

  • dump_date – A string formatted as “YYYYMMDD” for the wikipedia dump to use. If None, latest dump is used.

  • output_type – The file type to save the data as.

  • raw_download_dir – Path to store the raw download files for intermediate processing. If None, they are stored in a folder named “downloads” under output_path.

  • keep_raw_download – If True, keeps the bz2 files that have not been extracted.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • url_limit – The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.

class nemo_curator.download.WikipediaDownloader(download_dir, verbose=False)#
class nemo_curator.download.WikipediaIterator(language='en', log_frequency=1000)#
class nemo_curator.download.WikipediaExtractor(language='en', parser=mwparserfromhell)#

Arxiv#

class nemo_curator.download.download_arxiv(
output_path: str,
output_type: str = 'jsonl',
raw_download_dir=None,
keep_raw_download=False,
force_download=False,
url_limit=None,
)#

Downloads Arxiv tar files and extracts them

Parameters:
  • output_path – The path to the root directory of the files

  • output_type – The file type to save the data as.

  • raw_download_dir – Path to store the raw download files for intermediate processing. If None, they are stored in a folder named “downloads” under output_path.

  • keep_raw_download – If True, keeps the compressed WARC files that have not been extracted.

  • force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.

  • url_limit – The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.

class nemo_curator.download.ArxivDownloader(download_dir, verbose=False)#
class nemo_curator.download.ArxivIterator(log_frequency=1000)#
class nemo_curator.download.ArxivExtractor#