Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Download and Extract Text

Background

Publicly hosted text datasets are stored in various locations and formats. Downloading a massive public dataset is usually the first step in data curation, and it can be cumbersome due to the dataset’s massive size and hosting method. Also, massive pretraining text datasets are rarely in a format that can be immediately operated on for further curation and training. For example, the Common Crawl stores its data in a compressed web archive format (.warc.gz) for its raw crawl data, but formats like .jsonl are more common for data curation due to their ease of use. However, extraction can be by far the most computational expensive step of the data curation pipeline, so it can be beneifical to do some filtering prior to the extraction step to limit the amount of documents that undergo this heavy computation.

NeMo Curator provides example utilities for downloading and extracting Common Crawl, ArXiv, and Wikipedia data. In addition, it provides a flexible interface to extend the utility to other datasets. Our Common Crawl example demonstrates how to process a crawl by downloading the data from S3, doing preliminary language filtering with pyCLD2, and extracting the relevant text with jusText or Resiliparse to output .jsonl files.

NeMo Curator currently does not provide out-of-the-box support for web-crawling or web-scraping. It provides utilities for downloading and extracting data from the preexisting online sources given above. Users can easily implement these functions themselves and automatically scale them with the framework described below if they would like.

Usage

nemo_curator.download has a collection of functions for handling the download and extraction of online datasets. By “download”, we typically mean the transfer of data from a web-hosted data source to local file storage. By “extraction”, we typically mean the process of converting a data format from its raw form (e.g., .warc.gz) to a standardized format (e.g., .jsonl) and discarding irrelvant data.

  • download_common_crawl will download and extract the compressed web archive files of common crawl snapshots to a target directory. Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly set up your credentials with s5cmd. Otherwise, the HTTPS endpoints will be used with wget. Here is a small example of how to use it:

    from nemo_curator.download import download_common_crawl
    
    common_crawl = download_common_crawl("/extracted/output/folder", "2020-50", "2021-04", output_type="jsonl")
    
    • "/extracted/output/folder" is the path to on your local filesystem where the final extracted files will be placed.

    • "2020-50" is the first common crawl snapshot that will be included in the download. Note: Not every year and week has a snapshot. Ensure that your range includes at least one valid Common Crawl snapshot. A list of valid Common Crawl snapshots can be found here.

    • "2021-04" is the last common crawl snapshot that will be included in the download.

    • output_type="jsonl" is the file format that will be used for storing the data on disk. Currently "jsonl" and "parquet" are supported.

You can choose to modify the HTML text extraction algorithm used in download_common_crawl. See an example below.

from nemo_curator.download import (
  ResiliparseExtractor,
  download_common_crawl,
)

# Change the extraction algorithm
extraction_algorithm = ResiliparseExtractor()
common_crawl = download_common_crawl(
  "/extracted/output/folder",
  "2020-50",
  "2021-04",
  output_type="jsonl",
  algorithm=extraction_algorithm,
)

Above, we changed the extraction algorithm from the default JusTextExtractor.

The return value common_crawl will be in NeMo Curator’s standard DocumentDataset format. Check out the function’s docstring for more parameters you can use.

NeMo Curator’s Common Crawl extraction process looks like this under the hood:

  1. Decode the HTML within the record from binary to text.

  2. If the HTML can be properly decoded, then with pyCLD2, perform language detection on the input HTML.

  3. Finally, the extract the relevant text with jusText or Resiliparse from the HTML and write it out as a single string within the ‘text’ field of a json entry within a .jsonl file.

  • download_wikipedia will download and extract the latest wikipedia dump. Files are downloaded using wget. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.

    from nemo_curator.download import download_wikipedia
    
    wikipedia = download_wikipedia("/extracted/output/folder", dump_date="20240201")
    
    • "/extracted/output/folder" is the path to on your local filesystem where the final extracted files will be placed.

    • dump_date="20240201" fixes the Wikipedia dump to a specific date. If no date is specified, the latest dump is used.

  • download_arxiv will download and extract latex versions of ArXiv papers. They are hosted on S3, so ensure you have properly set up your credentials with s5cmd.

    from nemo_curator.download import download_arxiv
    
    arxiv = download_arxiv("/extracted/output/folder")
    
    • "/extracted/output/folder" is the path to on your local filesystem where the final extracted files will be placed.

All of these functions return a DocumentDataset of the underlying dataset and metadata that was obtained during extraction. If the dataset has been downloaded and extracted at the path passed to it, it will read from the files there instead of downloading and extracting them again. Due to how massive each of these datasets are (with Common Crawl snapshots being on the order of hundreds of terrabytes) all of these datasets are sharded accross different files. They all have a url_limit parameter that allows you to only download a small number of shards.