This guide covers the core concepts for acquiring and processing text data from remote sources in NeMo Curator. Data acquisition focuses on downloading, extracting, and converting remote data sources into the DocumentBatch format for further processing.
Data acquisition in NeMo Curator follows a four-stage architecture:
This process transforms diverse remote data sources into a standardized DocumentBatch that can be used throughout the text curation pipeline.
The data acquisition framework consists of four abstract base classes that define the acquisition workflow:
Generates URLs for downloading from minimal input configuration. You need to override generate_urls which generates a bunch of URLs that user wants to download.
Example Implementation:
Connects to and downloads data from remote repositories. You must override _get_output_filename and _download_to_path which are called by an underlying function called download which tries to be idempotent.
Example Implementation:
Extracts individual records from downloaded containers. You should only override iterate and output_columns where iterate must have logic to load the local file path and return bunch of documents. The list[dict] is finally considered to a Pandas DataFrame which is passed to Extractor.
Example Implementation:
DocumentExtractor works on a Pandas DataFrame and is optional.
Example Implementation:
NeMo Curator provides built-in support for major public text datasets:
Download and extract web archive data from Common Crawl
Download and extract scientific papers from arXiv
Download and extract Wikipedia articles from Wikipedia dumps
Implement a download and extract pipeline for a custom data source
The data acquisition process seamlessly integrates with NeMo Curator’s pipeline-based architecture. The DocumentDownloadExtractStage handles parallel processing through the distributed computing framework.
Data acquisition leverages distributed computing frameworks for scalable processing:
Data acquisition produces a standardized output that integrates seamlessly with Curator’s Data Loading Concepts:
Data acquisition includes basic content-level deduplication during extraction (such as removing duplicate HTML content within individual web pages). This is separate from the main deduplication pipeline stages (exact, fuzzy, and semantic deduplication) that operate on the full dataset after acquisition.
This enables you to: