Text Data Loading#

Load text data from a variety of data sources using NeMo Curator.

NeMo Curator provides tools for downloading and processing large-scale public text datasets. Common data formats like Common Crawl’s .warc.gz are automatically converted to more processing-friendly formats like .jsonl.

How it Works#

NeMo Curator’s data loading framework consists of three main components:

  1. Downloaders: Responsible for retrieving data from source locations (DocumentDownloader)

  2. Iterators: Parse through downloaded data to identify individual documents (DocumentIterator)

  3. Extractors: Extract and clean text from raw document formats (DocumentExtractor)

Each supported data source has specific implementations of these components optimized for that data type. The result is a standardized DocumentDataset that can be used for further curation steps.

from nemo_curator import get_client
from nemo_curator.download import download_common_crawl, download_wikipedia, download_arxiv

# Initialize a Dask client
client = get_client(cluster_type="cpu")

# Download and extract data using correct parameter names
dataset = download_common_crawl(
    output_path="/output/folder", 
    start_snapshot="2020-50", 
    end_snapshot="2021-04"
)

# Write to disk in the desired format
dataset.to_json(output_path="/output/folder", write_to_filename=True)
# Generic download and extract utility
# Requires a YAML configuration file specifying downloader, iterator, and extractor implementations
# Example config files: config/cc_warc_builder.yaml, config/arxiv_builder.yaml, config/wikipedia_builder.yaml
download_and_extract \
  --input-url-file=<Path to URL list> \
  --builder-config-file=<Path to YAML config file> \
  --output-json-dir=<Output directory>

# Alternative: Extract from pre-downloaded files (extraction-only mode)
download_and_extract \
  --input-data-dir=<Path to downloaded files> \
  --builder-config-file=<Path to YAML config file> \
  --output-json-dir=<Output directory>

# Common Crawl URL retrieval utility
# Generates a list of WARC file URLs for specified snapshot range
get_common_crawl_urls \
  --starting-snapshot="2020-50" \
  --ending-snapshot="2020-50" \
  --output-warc-url-file=./warc_urls.txt

Data Sources & File Formats#

Load data from public, local, and custom data sources.

arXiv

Extract and process scientific papers from arXiv

ArXiv
Common Crawl

Load and preprocess text data from Common Crawl web archives

Common Crawl
Custom Data

Load your own text datasets in various formats

Custom Data Loading
Wikipedia

Import and process Wikipedia articles for training datasets

Wikipedia