Create an Image Curation Pipeline
Learn the basics of creating an image curation pipeline in Curator by following a complete workflow that filters images by aesthetic quality and NSFW content.
Before You Start
- Follow the Get Started guide to install the package, prepare the model directory, and set up your data paths.
Concepts and Mental Model
Use this overview to understand how stages pass data through the pipeline.
- Pipeline: An ordered list of stages that process data.
- Stage: A modular operation (for example, read, embed, filter, write).
- Executor: Runs the pipeline (XennaExecutor backend).
- Data units: Input images → embeddings → quality scores → filtered output.
- Common choices:
- Embeddings: CLIP ViT-L/14 for semantic understanding
- Quality filters: Aesthetic predictor and NSFW classifier
- Thresholds: Configurable scoring thresholds for filtering
- Outputs: Filtered tar archives with embeddings, quality scores, and image data.
For more information, refer to the Image Concepts section.
1. Define Imports and Paths
Import required classes and define paths used throughout the example.
2. Create the Pipeline
Instantiate a named pipeline to orchestrate the stages.
3. Define Stages
Add modular stages to partition, read, embed, filter, and write images.
Partition Input Files
Divide tar files across workers for parallel processing.
Read Images
Load images from tar archives and extract metadata.
Generate CLIP Embeddings
Create semantic embeddings for each image using CLIP ViT-L/14.
Filter by Aesthetic Quality
Score and filter images based on aesthetic quality using a trained predictor.
Filter NSFW Content
Detect and filter out NSFW (Not Safe For Work) content.
Write Filtered Dataset
Save the curated images and metadata to output directory.
4. Run the Pipeline
Execute the configured pipeline. The pipeline will use XennaExecutor by default if no executor is specified.
Complete Example
Here’s the full pipeline code:
Performance Tuning
Batch Size Guidelines
The model_inference_batch_size parameter controls the number of images that the model processes at once, directly affecting GPU memory usage and performance. Higher batch sizes improve throughput but require more GPU memory, while lower batch sizes prevent out-of-memory (OOM) errors.
Memory Requirements and Actor Allocation
For the embedding stage (which uses the most GPU memory), each actor requires 0.25 GPU allocation, corresponding to about 20GB of GPU memory on an 80GB GPU (0.25 × 80GB = 20GB). When adjusting batch sizes, ensure memory usage stays below this 20GB threshold per actor.
Performance vs Memory Trade-offs
- Higher batch sizes: Better GPU usage and faster processing, but risk OOM errors
- Lower batch sizes: Safer memory usage and OOM prevention, but slower processing
- Rule of thumb: Start with the recommended batch size for your GPU, then increase gradually while monitoring memory usage
GPU Allocation
The default num_gpus_per_worker=0.25 works well for most scenarios and allows 4 workers per GPU. You typically don’t need to change this unless you have specific requirements:
- Default (recommended):
0.25allows multiple workers per GPU for better utilization - Single worker per GPU: Set to
1.0if you want exclusive GPU access per worker - CPU processing: Set to
0(embedding and filtering stages require GPUs for optimal performance)
The system automatically calculates the optimal number of actors based on available resources and the num_gpus_per_worker setting.
Next Steps
After running this basic curation pipeline, you can:
- Adjust thresholds: Experiment with different aesthetic and NSFW score thresholds
- Add custom filters: Create domain-specific quality filters
- Run duplicate removal: Remove similar or duplicate images using semantic duplicate removal
- Export for training: Prepare curated data for downstream ML training tasks
For more advanced workflows, refer to the Image Duplicate Removal Tutorial.