Custom Data Loading
Create custom data loading pipelines using Curator. This guide shows how to build modular stages that run on Curator’s distributed processing.
How It Works
Curator uses the same 4-step pipeline pattern described in Data Acquisition Concepts for custom data loading. Each step uses an abstract base class with corresponding processing stages that compose into pipelines.
Architecture Overview
For detailed information about the core components and data flow, see Data Acquisition Concepts and Data Loading Concepts.
Implementation Guide
1. Create Directory Structure
2. Build Core Components
URL Generator (url_generation.py)
Document Downloader (download.py)
Document Iterator (iterator.py)
Document Extractor (extract.py)
3. Create Composite Stage (stage.py)
Usage Examples
Basic Pipeline
For executor options and configuration, refer to Reference Execution Backends.
Parameters Reference
Custom Data Loading Parameters
Output Format
Processed data flows through the pipeline as DocumentBatch tasks containing Pandas DataFrames or PyArrow Tables: