Data Acquisition Concepts
This guide covers the core concepts for acquiring and processing text data from remote sources in NeMo Curator. Data acquisition focuses on downloading, extracting, and converting remote data sources into the DocumentBatch format for further processing.
Overview
Data acquisition in NeMo Curator follows a three-stage architecture:
- Generate URLs: Discover and generate download URLs from minimal input.
- Download: Retrieve raw data files from remote sources.
- Iterate and Extract: Extract individual records from downloaded containers and convert raw content to clean, structured text.
This process transforms diverse remote data sources into a standardized DocumentBatch that can be used throughout the text curation pipeline.
Core Components
The data acquisition framework consists of abstract base classes that define the acquisition workflow:
URLGenerator
Generates URLs for downloading from minimal input configuration. Override the generate_urls method, which returns a list of URLs for the user to download.
Example Implementation:
DocumentDownloader
Connects to and downloads data from remote repositories. Override _get_output_filename and _download_to_path, which are called by the underlying download function, designed to be idempotent.
Example Implementation:
DocumentIterator
Extracts individual records from downloaded containers. Override iterate to load the file at the given path and yield records, and override output_columns to declare the output schema. Records are passed directly to the extractor (if provided) inline during iteration.
Example Implementation:
DocumentExtractor (Optional)
DocumentExtractor transforms individual records and is optional. When provided to DocumentIterateExtractStage, it processes each record inline during iteration rather than as a separate stage.
Example Implementation:
Supported Data Sources
NeMo Curator provides built-in support for major public text datasets:
Download and extract web archive data from Common Crawl web-scale multilingual
Download and extract scientific papers from arXiv academic scientific
Download and extract Wikipedia articles from Wikipedia dumps encyclopedic structured
Implement a download and extract pipeline for a custom data source extensible specialized
Integration with Pipeline Architecture
The data acquisition process seamlessly integrates with NeMo Curator’s pipeline-based architecture. The DocumentDownloadExtractStage handles parallel processing through the distributed computing framework.
Acquisition Workflow
Performance Optimization
Parallel Processing
Data acquisition leverages distributed computing frameworks for scalable processing:
- Parallel Downloads: Each URL in the generated list downloads through separate workers.
- Concurrent Extraction: Files process in parallel across workers.
- Memory Management: Streaming processing for large files.
Integration with Data Loading
Data acquisition produces a standardized output that integrates seamlessly with Curator’s Data Loading Concepts:
:::{note} Data acquisition includes basic content-level deduplication during extraction (such as removing duplicate HTML content within individual web pages). This is separate from the main deduplication pipeline stages (exact, fuzzy, and semantic deduplication) that operate on the full dataset after acquisition. :::
This enables you to:
- Separate acquisition from processing for better workflow management.
- Cache acquired data to avoid re-downloading.
- Mix acquired and local data in the same processing pipeline.
- Use standard loading patterns regardless of data origin.