Download and Extract Text#

Background#

Publicly hosted text datasets are stored in various locations and formats. Downloading a massive public dataset is usually the first step in data curation, and it can be cumbersome due to the dataset’s massive size and hosting method. Also, massive pretraining text datasets are rarely in a format that can be immediately operated on for further curation and training. For example, the Common Crawl stores its data in a compressed web archive format (.warc.gz) for its raw crawl data, but formats like .jsonl are more common for data curation due to their ease of use. However, extraction can be by far the most computational expensive step of the data curation pipeline, so it can be beneifical to do some filtering prior to the extraction step to limit the amount of documents that undergo this heavy computation.

NeMo Curator provides example utilities for downloading and extracting Common Crawl, ArXiv, and Wikipedia data. In addition, it provides a flexible interface to extend the utility to other datasets. Our Common Crawl example demonstrates how to process a crawl by downloading the data from S3, doing preliminary language filtering with pyCLD2, and extracting the relevant text with multiple extraction methods including jusText, Resiliparse, and Trafilatura to output .jsonl files.

NeMo Curator currently does not provide out-of-the-box support for web-crawling or web-scraping. It provides utilities for downloading and extracting data from the preexisting online sources given above. Users can easily implement these functions themselves and automatically scale them with the framework described below if they would like.

Usage#

nemo_curator.download has a collection of functions for handling the download and extraction of online datasets. By “download”, we typically mean the transfer of data from a web-hosted data source to local file storage. By “extraction”, we typically mean the process of converting a data format from its raw form (e.g., .warc.gz) to a standardized format (e.g., .jsonl) and discarding irrelvant data.

download_common_crawl will download and extract the compressed web archive files of common crawl snapshots to a target directory. Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly set up your credentials with s5cmd. Otherwise, the HTTPS endpoints will be used with wget. Here is a small example of how to use it:

import os
from nemo_curator import get_client
from nemo_curator.download import download_common_crawl
from nemo_curator.datasets import DocumentDataset

def main():
    # Initialize a distributed Dask client
    client = get_client(cluster_type="cpu")

    # Parameters for downloading Common Crawl data.
    # - output_folder: directory for temporary download/extraction files
    # - start_snapshot and end_snapshot define the range to fetch
    # - output_type: specifies file format for the extracted data (e.g., "jsonl")
    output_folder = "/extracted/output/folder"
    start_snapshot = "2020-50"
    end_snapshot = "2021-04"
    output_type = "jsonl"
    os.makedirs(output_folder, exist_ok=True)

    # Download and extract the Common Crawl data.
    # The function returns a DocumentDataset that contains the extracted documents.
    # Note: The output folder and output type are passed here to store intermediate files
    # and check if the data has already been downloaded. They should match the final location
    # and format of the extracted data.
    common_crawl_dataset = download_common_crawl(
        output_folder, start_snapshot, end_snapshot, output_type=output_type
    )

    # Write the extracted dataset to JSON format.
    # The 'to_json' method will write one JSON document per line,
    # preserving the original shard information if write_to_filename is True.
    common_crawl_dataset.to_json(output_path=output_folder, write_to_filename=True)
    print("Extracted dataset saved to:", output_folder)

if __name__ == "__main__":
    main()

"/extracted/output/folder" is the path to on your local filesystem where the final extracted files will be placed.
"2020-50" is the first common crawl snapshot that will be included in the download. Note: Not every year and week has a snapshot. Ensure that your range includes at least one valid Common Crawl snapshot. A list of valid Common Crawl snapshots can be found here.
"2021-04" is the last common crawl snapshot that will be included in the download.
output_type="jsonl" is the file format that will be used for storing the data on disk. Currently "jsonl" and "parquet" are supported.

You can choose to modify the HTML text extraction algorithm used in download_common_crawl. See an example below.

import os
from nemo_curator import get_client
from nemo_curator.download import (
    ResiliparseExtractor,
    TrafilaturaExtractor,
    download_common_crawl,
)
from nemo_curator.datasets import DocumentDataset

def main():
    # Initialize a distributed Dask client
    client = get_client(cluster_type="cpu")

    # Parameters for downloading Common Crawl data.
    # - output_folder: directory for temporary download/extraction files
    # - start_snapshot and end_snapshot define the range to fetch
    # - output_type: specifies file format for the extracted data (e.g., "jsonl")
    output_folder = "/extracted/output/folder"
    start_snapshot = "2020-50"
    end_snapshot = "2021-04"
    output_type = "jsonl"
    os.makedirs(output_folder, exist_ok=True)

    # Change the extraction algorithm to Resiliparse
    extraction_algorithm = ResiliparseExtractor()
    # Alternatively, change the extraction algorithm to Trafilatura
    # extraction_algorithm = TrafilaturaExtractor()

    # Download and extract the Common Crawl data using the Resiliparse extraction algorithm.
    # The function returns a DocumentDataset that contains the extracted documents.
    common_crawl_dataset = download_common_crawl(
        output_folder,
        start_snapshot,
        end_snapshot,
        output_type=output_type,
        algorithm=extraction_algorithm,
    )

    # Write the extracted dataset to JSON format.
    # The 'to_json' method writes one JSON document per line,
    # preserving the original shard information if write_to_filename is True.
    common_crawl_dataset.to_json(output_path=output_folder, write_to_filename=True)
    print("Extracted dataset saved to:", output_folder)

if __name__ == "__main__":
    main()

Above, we changed the extraction algorithm from the default JusTextExtractor. Note: The JusTextExtractor, ResiliparseExtractor, and TrafilaturaExtractor classes each have their own unique parameters which are specific to their extraction algorithms. Please see the docstrings for each class for more details.

You can set your own dictionary of stop words by language to be used when extracting text:

from nemo_curator.download import download_common_crawl

# Change the default stop list used
stop_lists = {"ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"])}

common_crawl = download_common_crawl(
    "/extracted/output/folder",
    "2020-50",
    "2021-04",
    output_type="jsonl",
    stop_lists=stop_lists,
)

This may be desirable to further customize your text extraction pipeline, or to enable text extraction support for languages not included by jusText and NeMo Curator.

The return value common_crawl will be in NeMo Curator’s standard DocumentDataset format. Check out the function’s docstring for more parameters you can use.

NeMo Curator’s Common Crawl extraction process looks like this under the hood:

Decode the HTML within the record from binary to text.
If the HTML can be properly decoded, then with pyCLD2, perform language detection on the input HTML.
Finally, the extract the relevant text with jusText, Resiliparse, or Trafilatura from the HTML and write it out as a single string within the “text” field of a JSON entry within a .jsonl file.

download_wikipedia will download and extract the latest wikipedia dump. Files are downloaded using wget. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.
```
from nemo_curator.download import download_wikipedia

wikipedia = download_wikipedia("/extracted/output/folder", dump_date="20240201")
```
- "/extracted/output/folder" is the path to on your local filesystem where the final extracted files will be placed.
- dump_date="20240201" fixes the Wikipedia dump to a specific date. If no date is specified, the latest dump is used.
download_arxiv will download and extract latex versions of ArXiv papers. They are hosted on S3, so ensure you have properly set up your credentials with s5cmd.
```
from nemo_curator.download import download_arxiv

arxiv = download_arxiv("/extracted/output/folder")
```
- "/extracted/output/folder" is the path to on your local filesystem where the final extracted files will be placed.

All of these functions return a DocumentDataset of the underlying dataset and metadata that was obtained during extraction. If the dataset has been downloaded and extracted at the path passed to it, it will read from the files there instead of downloading and extracting them again. Due to how massive each of these datasets are (with Common Crawl snapshots being on the order of hundreds of terrabytes) all of these datasets are sharded accross different files. They all have a url_limit parameter that allows you to only download a small number of shards.

Download and Extract Text#

Background#

Usage#

Related Scripts#

Common Crawl Example#

Set Up Common Crawl#

Download and Extract Common Crawl#