Text Data Loading#
Load text data from a variety of data sources using NeMo Curator.
NeMo Curator provides tools for downloading and processing large-scale public text datasets. Common data formats like Common Crawl’s .warc.gz
are automatically converted to more processing-friendly formats like .jsonl
.
How it Works#
NeMo Curator’s data loading framework consists of three main components:
Downloaders: Responsible for retrieving data from source locations (
DocumentDownloader
)Iterators: Parse through downloaded data to identify individual documents (
DocumentIterator
)Extractors: Extract and clean text from raw document formats (
DocumentExtractor
)
Each supported data source has specific implementations of these components optimized for that data type. The result is a standardized DocumentDataset
that can be used for further curation steps.
from nemo_curator import get_client
from nemo_curator.download import download_common_crawl, download_wikipedia, download_arxiv
# Initialize a Dask client
client = get_client(cluster_type="cpu")
# Download and extract data using correct parameter names
dataset = download_common_crawl(
output_path="/output/folder",
start_snapshot="2020-50",
end_snapshot="2021-04"
)
# Write to disk in the desired format
dataset.to_json(output_path="/output/folder", write_to_filename=True)
# Generic download and extract utility
# Requires a YAML configuration file specifying downloader, iterator, and extractor implementations
# Example config files: config/cc_warc_builder.yaml, config/arxiv_builder.yaml, config/wikipedia_builder.yaml
download_and_extract \
--input-url-file=<Path to URL list> \
--builder-config-file=<Path to YAML config file> \
--output-json-dir=<Output directory>
# Alternative: Extract from pre-downloaded files (extraction-only mode)
download_and_extract \
--input-data-dir=<Path to downloaded files> \
--builder-config-file=<Path to YAML config file> \
--output-json-dir=<Output directory>
# Common Crawl URL retrieval utility
# Generates a list of WARC file URLs for specified snapshot range
get_common_crawl_urls \
--starting-snapshot="2020-50" \
--ending-snapshot="2020-50" \
--output-warc-url-file=./warc_urls.txt
Data Sources & File Formats#
Load data from public, local, and custom data sources.
Extract and process scientific papers from arXiv
Load and preprocess text data from Common Crawl web archives
Load your own text datasets in various formats
Import and process Wikipedia articles for training datasets